A202  666 


AFIT/QE/EHa/8aD-36 


HODIFIEP  BACKWARD  ERROR  PROPAGATION 
FOR  TACTICAL  TARGET  RECOGNITION 
THESIS 

Charles  C.  Piazza 
Captain,  USAF 

APIT/GE/ENa/86D- 36 


1  9  j 


Approved  for  puDllc  release;  distribution  unlirrated 


AFIT/QE/ENG/8eD-36 


MODIFIED  BACKWARD  ERROR  PROPAOATIOH 
FOR  TACTICAL  TARGET  RBCOOHITIOH 


THESIS 

Presented  to  the  Faculty  of  the  School  of  Engineering 
of  the  Air  Force  Institute  of  Technology 
Air  University 

In  Partial  Fulfillment  of  the 
Requirements  for  the  Degree  of 
Master  of  Science  in  Electrical  Engineering 


Charles  C.  Piazza,  B.  S, 
Captain,  USAF 


December  1966 


Approved  for  public  release;  distribution  unlimited 


Acknowledgments 


There  were  many  people  behind  the  scenes  of  this  research  effort, 
who  played  Key  roles  in  the  development  of  my  thesis,  First  and 
foremost,  is  my  complete  understanding 

and  support,  and  her  many  words  of  encouragement,  my  potential 
w's  rea.^-red  and  personal  goals  achieved.  I  am  truly  indebted  to 
Dr,  Steven  K.  Rogers  and  Dr.  Mark  E.  Oxley  for  their  seemingly 
unlimited  amount  of  knowledge,  advice,  and  time  afforded  to  me. 
Special  thanks  to  Dr.  Rogers,  for  instilling  within  me  his  great 
enthusiasm  for  higher  learning. 

I  would  also  like  to  thank  Dr.  Mathew  Kabrisky  for  his 
assistance  and  input,  and  especially  for  his  wit  and  wonderful 
view  of  the  world  in  which  we  live,  Many  thanks  to  Captain  Mike 
Roggemann  for  his  efforts  in  providing  me  with  the  forward 
looking  infrared  imagery  feature  vectors,  and  for  our  many 
discussions. 


Accession  For 

MTIS  GRA&I 
DXIC 

UiuinnouTvj'j'l  □ 

Juiitif  h-'-tloa - 


Pt 

Dlr-lr 

ulrjtion/ 

jiWr.i 

.rihlllf.y  Codes 

Avf  i  1  f,  nd/or 

Diet 

Spec 

id 

_ 

« 


11 


Table  of  Contents 


Acknowledgments  .  ii 

List  of  Figures .  vi 

List  of  Tables .  ix 

Abstract  .  x 

1.  Introduction .  i-i 

1.1.  Historical  Background .  i-i 

1.2.  Problem  Statement  .  l-l 

1.  3.  Scope  .  1-2 

1.4.  Approach  and  Methodology .  1-2 

1.5.  Thesis  Organization .  i-4 

2.  Background  Material  .  2-i 

2.  1.  Introduction .  2-1 

2.2.  Image  Preprocessing  .  2-i 

2.2.  1.  Moment  Invariant  Feature  Vectors  .  2-i 

2.2.2.  Other  Features  .  2-3 

2.3.  Introduction  to  Artificial  Neural  Networks  .  2-4 

2.  3.  1.  Multilayer  Perceptron .  2-5 

2.  3.  2.  Notation .  2-5 

2.  3.  3.  Backward  Error  Propagation .  2-7 

2.3.4.  Error  Signals  .  2-10 

2.3.5.  First  Order  Minimization .  2-11 

2.  4.  Introduction  to  Second  Order  Minimization 

Techniques  . 2-12 

2.4.  1.  Performance  Surface  .  2-14 

2.4.2.  Second  Order  Differential  Equation .  2-15 

2.4.3.  Second  Order  Implementing  Equations  .  2-18 

2.5.  Bayesian  Classifier  .  2-2i 

2.  5.  1.  Implementation .  2-21 

2.  5.  2.  Addition  of  Bins  .  2-23 

2.  6.  Summary .  2-25 

3.  Second  Order  Algorithm  and  Network  Convergence  .  3-1 


ill 


3.  1.  Introduction .  3-1 

3.2.  Definition  of  Network  Performance .  3-1 

3.  2.  1.  First  Partial  Derivative .  3-2 

3.  2.  2.  Second  Partial  Derivative .  3-3 

3.  2.  3.  Approximate  Average  Second  Partial  Derivative  .  .  3-4 

3.  3.  Second  Order  Algorithm  Development .  3-5 

3.4.  Generalized  Second  Order  Algorithm .  3-13 

3.4.  1.  Steepest  Descent  Algorithm .  3-13 

3.4.2.  Momentum  Algorithm .  3-14 

3.4.3.  Additive  Noise .  3-15 

3.4.4.  Second  Order  Contributions .  3-15 

3.5.  Final  Implementation  Stage  .  3-15 

3.  5.  1.  Forward  Pass  .  3-18 

3.  5.  2.  Backward  Pass  .  3-19 

3.  6.  Network  Convergence  Considerations  .  3-23 

3.  7.  Summary .  3-26 

4.  Validation  of  Second  Order  Algorithm  .  4-1 

4.  1.  Introduction .  4-1 

4.2.  Exclusive  OR  Problem .  4-2 

4.  2.  1.  Input  Data  and  Network  Parameters  .  4-2 

4.2.2.  Convergence  Results  .  4-4 

4.  3.  Classif icatlon  of  Doppler  Imagery  .  4-6 

4.  3.  1.  Input  Feature  Data .  4-6 

4.  3.  2.  Network  Architecture  and  Learning  Parameters  .  .  .  4-7 

4.3.3.  Classification  Results .  4-9 

4.  3.  3.  1.  Average  Classification  Accuracy .  4-9 

4.  3.  3.  2.  Average  Total  Output  Error .  4-13 

4.  3.  3.  3.  Target  Accuracy .  4-18 

4.  4.  Summary .  4-22 

5.  Classification  of  Forward  Looking  Infrared  Imagery  .  .  .  5-1 

5.  1.  Introduction .  5-1 

5.2.  Target  and  Non-Target  Feature  Classification  .  5-1 


IV 


5.  2.  1.  Input  Feature  Data . 

5.  2.  2.  Network  Arcliitecture  and  Learning  Parameters  . 
5.  2.  3.  Classification  Results . 

5.  2.  3.  1.  Instantaneous  Classification  Accuracy . 

5.  2.  3.  2.  Average  Total  Output  Error . 

5.  2.  3.  3.  Neural  Net  Classifier  Versus  Bayesian 

Classifier  . 

5.  3.  Moment  Invariant  Feature  Classification . 

5.  5.  1.  Input  Feature  Data . 

5.  3.  2.  Network  Architecture  and  Learning  Parameters  . 
5.  3.  3.  Classification  Results  . 

5.  3.  3.  1.  Average  Classification  Accuracy . 

5.  3.  3.  2.  Average  Total  Output  Error . 

5.  4.  Summary . 

6.  Discussions,  Recommendations,  and  Conclusions  . 

6.  1.  Discussions  . 

6.  2.  Recommendations  . 

6.  1.  Conclusions  . 


Appendix  A:  An  Iterative  Approach  to  Solving  Linear 

Differential  Equations  . . . 

Appendix  B;  Linear  Algebraic  Forms  and  Notation  . 

Appendix  C;  Partial  Derivatives  of  the  Sigmoid  Function 

Appendix  D;  Second  Order  Convergence  Conditions  for  a 
Sing  1 e  Cell  . 

Appendix  E:  Further  Comparisons  with  the  Bayesian 

Classifier  . 

Appendix  F:  XOR  Model  . 

Appendix  G:  ADA  Programming  Model  . 

Bibliography  . 


List  of  Figures 


Figure 

2.  1  Typical  Multilayer  Perceptron  Architecture . 

2.2  Functions  of  a  Single  Cell  on  Forwara  Pass  . 

2.  3  Sigmoid  Transfer  Function . 

'  2.4  Functions  of  a  Single  Cell  on  Backward  Pass  . 

2.  5  Signal  Flow  Through  Cell  Using  Second  Order 

Implementation  . 

2.  6  Typical  Discrete  Conditional  PDF . 

I 

3.  1  Signal  Flow  Through  a  Single  Cell  . 

3.  2  Two  Layer  Network  Display . 

4.  1  XOR  Network  Architecture  . 

^  4.  2  Average  Training  Classification  Accuracy  for 

Gradient  Method  . 

4.  3  Average  Training  Cl assi f ication  Accuracy  for 

Momentum  Method  . 

I 

4. 4  Average  Training  Classification  Accuracy  for 

Second  Order  Method  . 

4.5  Average  Test  Data  Classification  Accuracy  for 

Gradient  Method  . 

4.6  Average  Test  Data  Classification  Accuracy  for 

Momentum  Method  . 

4.7  Average  Test  Data  Classification  Accuracy  for 

Second  Order  Method  . 

4. 8  Average  Total  Output  Error  Using  Training  Data  for 

Gradient  Method  . 

4. 9  Average  Total  Output  Error  Using  Training  Data  for 

Momentum  Method  . 

4.  10  Average  Total  Or.tput  Error  Using  Training  Data  for 
Second  Order  Method  . 


2-5 

2-8 

2-9 

2-10 

2-20 

2- 24 

3- 20 

3- 21 

4-  3 

4- 10 

4-  1  1 

4-1  1 

4-  12 

4-12 

4-  1  3 

4-14 

4-15 

4-15 


VI 


4.  1 1  Average  Total  Output  Error  Using  Test  Data  for 

Gradient  Metliod .  4-16 

4. 12  Average  Total  Output  Error  Using  Test  Data  for 

Momentiun  Metliod .  4-16 

4. 13  Average  Total  Output  Error  Using  Test  Data  for 

Second  Order  Method  .  4-17 

5. 1  Instantaneous  Training  Classification  Accuracy  for 

Gradient  Method  .  5-6 

5.  2  Instantaneous  Training  Classification  Accuracy  for 

Momentum  Method  .  5-6 

5.3  Instantaneous  Training  Classification  Accuracy  for 

Second  Order  Method  .  5-7 

5.4  Instantaneous  Test  Data  Classification  Accuracy  for 

Gradient  Method  .  5-7 

5.5  Instantaneous  Test  Data  Classification  Accuracy  for 

Momentum  Method  .  5-6 

5.  6  Instantaneous  Test  Data  Classification  Accuracy  for 

Second  Order  Method  .  5-8 

5. 7  Average  Total  Output  Error  Using  Training  Data  for 

Gradient  Method  (one  pass  through  network)  .  5-10 

5. 8  Average  Total  Output  Error  Using  Training  Data  for 

Momentum  Method  (one  pass  through  network)  .  5-10 

5.  9  Average  Total  Output  Error  Using  Training  Data  for 

Second  Order  Method  (one  pass  through  network)  .  5-1 l 

5. 10  Average  Training  Classification  Accuracy  for 

Gradient  Method  .  5-17 

5.  11  Average  Training  Cl assi f ication  Accuracy  for 

Momentum  Method  .  5-17 

5. 12  Average  Training  Classification  Accuracy  for 

Second  Order  Method  .  5-18 

5.  13  Comparisons  of  Gradient,  Momentum,  and  Second 

Order  Methods  on  Training  Data .  5-19 


5.  14  Average  Test  Data  Classification  Accuracy  for 
Gradient  Method  . 


5-20 


5.  15  Average  Test  Data  Classif ication  Accuracy  for 

Momentum  Metnod .  5-20 

5.  16  Average  Test  Data  Classification  Accuracy  for 

Second  Order  Metnod  .  5-21 

5.  17  Comparisons  of  Gradient,  Momentum,  and  Second 

Order  Methods  on  Test  Data .  5-22 

5. 18  Average  Total  Output  Error  Using  Training  Data  for 

Gradient  Method  .  5-23 

5. 19  Average  Total  Output  Error  Using  Training  Data  for 

Momentum  Method . 5-24 

5.  20  Average  Total  Output  Error  Using  Training  Data  for 

Second  Order  Method  .  5-24 

5. 21  Average  Total  Output  Error  Using  Test  Data  for 

Gradient  Method  .  5-25 

5. 22  Average  Total  Output  Error  Using  Test  Data  for 

Momentum  Method  .  5-26 

5. 23  Average  Total  Output  Error  Using  Test  Data  for 

Second  Order  Method  .  5-26 

A.  1  An  Illustrative  Graph  of  Newton's  Method .  A-4 

C.  1  A  Single  Cell  .  C-l 

D.  1  Single  Cell  and  Decision  Boundary:  Pictorial 

Problem  Description  .  D-2 

D.  2  Quadratic  Function  for  d  =  O .  D-7 

D.  3  Quadratic  Function  for  d  =  1  .  D-7 

D.  4  a(f )  for  d  =  O .  D-15 


D.  5  a  (  f )  for  d 


1 


D-  1  5 


List  of  Tables 


Tabl  e 

4.  1  Input  Pattern  Vectors  and  Desired  Response  for  XOR  .  4-3 

4,  a  Learning  Parameters  for  XOR .  4-4 

4.  3  Comparison  Between  First  and  Second  Order  Techniques 

for  XOR .  4-5 

4.4  Target  Data  Base  for  Classification  of  (doppler 

moment  invariants)  .  4-7 

4.  5  Network  Architecture  Data .  4-8 

4.6  Network  Training  Data .  4-8 

4.7  Training  Data  Confusion  Matrix  for  Gradient  Method  .  4-19 

4.8  Training  Data  Confusion  Matrix  •‘or  Momentum  Method  .  4-19 

4.  9  Training  Data  Confusion  Matrix  for  Second  Order 

Method .  4-20 

4.  10  Test  Data  Confusion  Matrix  for  Gradient  Method .  4-20 

4.  11  Test  Data  Confusion  Matrix  for  Momentum  Method .  4-2  1 

4.  12  Test  Data  Confusion  Matrix  for  Second  Order  Method  .  ^-21 

5. 1  Target  and  Non-Target  Sample  Breakdown  (FLIR 

feature  vectors)  .  5-2 

5.2  Network  ..x’chi tecture  Data .  5-3 

5.  3  Network  Training  Data .  5-4 

5.4  Classification  A.ccuracy  of  Neural  Net  Classifiers 

Versus  the  Bayesian  Classifier  .  5-12 

5.5  Target  Data  Base  (FLIR  moment  invariants)  .  5-14 

5.6  Network  Architecture  Data .  5-15 

5.  7  Network  Training  Data .  5-l5 

E.  1  Overall  Classification  Accuracy .  E-l 


IX 


Abstract 


/ 

t; 

This  thesis  explores  a  new  approach  to  the  classif  ication  of 
tactical  targets  using  a  new  biologically-based  neural  networK. 
The  targets  of  interest  were  generated  from  doppler  imagery  and 
forward  looking  infrared  imagery,  and  consisted  of  tanks,  trucks, 
armored  personnel  carriers,  Jeeps  and  petroleum,  oil,  and 
lubricant  tankers.  Each  target  was  described  by  feature  vectors, 
such  as  normalized  moment  invariants.  The  features  were 
generated  from  the  imagery  using  a  segmenting  proces.s.  These 
feature  vectors  were  used  as  the  input  to  a  neural  network 
classifier  for  tactical  target  recognition- 

The  neural  network  consisted  of  a  multilayer  perceptron 
architecture,  employing  a  backward  error  propagation  learning 
algorithm.  The  minimization  technique  used  was  an  approximation 
to  Newton's  method.  This  second  order  algorithm  is  a  generalized 
version  of  well  known  first  order  techniques,  i.e.,  gradient  of 
steepest  descent  and  momentum  methods.  Classification  using  both 
first  and  second  order  techniques  was  performed,  with  comparisons 
drawn. 


X 


Modified  Backward  Error  Propagation 


for  Tactical  Target  Recognition 

1.  Introduction 

1.1.  Historical  Background 

In  recent  years  there  has  been  an  enormous  increase  in  the 
interest  of  artificial  neural  networks  (ANNs)  in  a  variety  of 
disciplines.  One  of  the  reasons  behind  this  renewed  interest,  is 
ANNs  may  provide  a  solution  to  the  problem  of  machine 
interpretation  of  image  and  voice  patterns;  a  solution  that  has 
thus  far  eluded  the  digital  computer.  Therefore,  it  is  no  great 
wonder  that  ANNs  have-  sparked  the  interest  of  scientific  and 
engineering  groups  within  the  military  community.  From  a 
military  aspect,  if  machines  can  be  taught  to  learn  and  recognize 
patterns,  then  it  would  be  possible  to  realize  an  autonomous 
weapons  system.  A  piloted  aircraft  could  deliver  the  autonomous 
weapon  systems  well  outside  enemy  airspace,  allowing  the  weapon 
systems  to  seek  out  the  target  it  was  trained  to  destroy,  and 
minimize  the  danger  placed  on  the  pilot. 

1.2.  Problem  Statement 

The  thesis  problem  is  to  classify  tactical  targets  as  viewed 
from  forward  looking  infrared  (FLIR)  imagery,  and  doppler 
imagery.  The  classifier  to  be  used  is  a  computer  simulation  of 
an  ANN. 
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Tlie  targets  of  interest  to  be  classified  were  from  a 
tactical  scenario.  Doppler  and  FLIR  imagery  in  raw  form,  must 
be  preprocessed  before  being  submitted  as  the  input  to  an  ANN. 
Much  of  this  preprocessing  is  beyond  the  scope  of  this  thesis 
effort:  however,  when  deemed  necessary  the  reader  will  be 
directed  to  the  applicable  reference.  The  targets  to  be 
classified  from  the  doppler  imagery  consisted  of  M60  tanks, 
Petroleum,  Oil  and  Lubricant  (POL)  tankers,  Jeeps,  and  2.5  ton 
trucks  [12].  Targets  extracted  from  the  FLIR  imagery  consisted 
of  H551  tanks,  2.5  ton  flatbed  trucks,  M113  Armored  Personnel 
Carriers  (APCs),  and  CJ-5  Jeeps  [10]. 

The  ANN  architecture  used  for  this  study  was  the  multilayer 
perceptron  described  by  Richard  P.  Lippmann  [4:15-18].  Back 
propagation  techniques  will  bh  used  for  updating  the  network 
weights.  The  minimization  algorithms  used  were  first  and  second 
order  backward  error  propagation  methods.  The  second  order 
algorithm  is  a  generalized  version  of  the  first  order  algorithm 
and  is  also  an  approximation  to  Newton's  method,  derived  by  David 
B.  Parker  [7:593-600;  8]. 

1.4.  Approach  and  Methodology 

The  second  order  back  propagation  network  required 
validation  before  being  tested  and  used  as  a  classifier. 
Therefore,  before  addressing  the  pattern  classification  problem, 


the  network  will  be  tested  on  the  exclusive  OR  (XOR)  problem.  If 


tlie  network  can  solve  the  XOR  problem,  then  it  may  be  possible  to 
apply  the  network  on  the  more  difficult  task  of  pattern 
classification. 

Next,  if  the  potential  exists  for  pattern  classification,  it 
would  be  helpful  to  have  a  training  set  of  feature  vectors  which 
have  already  been  classified  with  a  neural  network.  Such  is  the 
case  with  the  doppler  imagery.  Dennis  Ruck  [IE]  trained  a 
network  using  an  algorithm  provided  by  Richard  P.  Lippmann 
[4:17],  and  using  moment  invariants  extracted  from  the  doppler 
imagery.  The  algorithm  was  a  first  order,  steepest  decent  search 
technique  applying  a  momentum  term.  An  important  result  of 
Ruck's  study,  for  this  thesis  effort,  was  that  the  network 
achieved  near  perfect  classification  of  the  training  set. 
Therefore,  the  doppler  imagery  will  play  an  important  role  during 
the  network  validation  stage. 

A  comparison  between  the  first  order  and  second  order 
techniques  using  this  data  will  follow.  Classification  accuracy 
of  the  training  set  and  the  test  data  set  will  be  measured 
against  number  of  iterations.  Moment  invariants,  from  the  same 
imagery  as  the  training  set  and  never  before  seen  by  the  network 
will  make  up  the  test  data  set.  Also,  log  error  plots  versus  the 
log  of  the  number  of  iterations  will  be  generated  for  comparison. 

The  final  task  was  classification  of  features  generated  from 
the  FLIR  imagery.  Various  other  features,  as  well  as  the  moment 
invariants  generated  from  the  FLIR  imagery  will  be  considered  for 
classification.  A  portion  of  the  features  will  be  used  for 


classification  and  comparison  with  a  Bayesian  classifier 
implemented  by  HiKe  Roggemann  [10].  The  tasK  will  consist  of 
training  the  networK  with  a  Known  training  set  and  measuring  the 
classification  accuracy  once  the  networK  has  been  trained. 
Again,  classification  accuracy  will  be  measured  using  the 
training  set  and  test  set.  First  and  second  order  methods  will 
be  used  in  comparison  with  the  Bayesian  classifier. 

To  conclude  the  section  on  classification  of  the  FLIR 
imagery  targets,  the  moment  invariants  will  be  considered 
explicitly  for  classification.  Similar  comparisons  will  be  drawn 
as  described  above  for  the  doppler  imagery,  for  both  the  first 
and  second  order  techniques. 

1.5.  Thesis  Organization 

This  chapter  served  as  the  introduction  to  the  thesis  effort 
undertaKen.  Chapter  two  provides  a  discussion  of  the  fundamental 
foundation  of  material  necessary  for  the  origins  of  the 
algorithms  developed  in  chapter  three.  Chapter  four  consists  of 
the  validation  stage  for  the  second  order  bacK  propagation  model. 
Chapter  five  contains  the  results  obtained  during  classification 
of  the  features  generated  from  the  FLIR  imagery.  Conclusions, 
recommendations,  and  discussions  of  the  results  follow  in  chapter 
six  to  conclude  the  thesis. 
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2.  Background  Material 

2.1.  Introduction 

Tills  Chapter  begins  with  a  brief  and  limited  discussion  on 
preprocessing  the  doppler  Imagery,  and  forward  looking  infrared 
(FLIR)  Imagery.  Hezt,  an  Introduction  to  artificial  neural 
networks  (ANNs)  follows,  along  with  a  brief  discussion  on  the 
current  use  of  AHNs  as  classifiers.  The  network  architecture, 
learning  algorithm,  and  minimization  algorithms  used  for  this 
study  will  be  included.  The  following  section  highlights  the 
significant  steps  in  Parker's  approximation  to  Newton's  method 
[61,  which  in  turn  will  be  followed  by  the  equations  used  to 
implement  this  approximation.  The  final  section  includes  a 
discussion  of  the  Bayesian  classifier. 


This 

Investigation 

required  a  review 

of  the  methods 

and 

approximations  used 

in 

solving 

differential 

equations 

and 

their 

discrete 

counterparts. 

the 

difference  equation. 

Therefore 

appendix  i 

k  has  been 

reserved 

for  a  review 

In  these 

areas. 

The 

algorithms 

involved 

are 

also  quite  heavily 

dependent 

on  linear 

algebraic 

forms,  so 

appendix 

B  has  been 

reserved 

for 

such 

discussions,  along  with  any  accompanying  notation. 

2.2.  Image  Preprocessing 

2.2.1.  Moment  invariant  Feature  Vectors 

As  mentioned  in  section  1.3,  objects  of  interest,  the 
targets,  must  be  preprocessed  before  being  applied  to  an  ANN. 


The  targets  must  be  extracted  from  the  raw  doppler  and  FLIR 


imagery.  Tills  process  of  extraction  Is  known  as  segmentation. 
Dennis  Ruck  [12]  describes  ttie  segmentation  of  tlie  doppler 
imagery,  wnere  as  Hike  Roggemann  [10]  used  a  variation  of  tlie 
tecbnlques  described  by  Azrlel  Rosenfeld  [il:6£-73]  for 
segmenting  tbe  objects  from  tbe  FLIR  imagery. 

Once  tbe  targets  bave  been  segmented,  a  set  of  features  is 
described  to  provide  shape  discrimination  between  targets.  Botb 
sets  of  images  used  moment  invariants  for  shape  description. 
Ruck  [13]  describes  the  technique  for  shaping  the  doppler 
imagery,  while  Roggemann  [10]  used  a  technique  described  by  Mlng- 
Kuel  Hu  [2:179-187]. 

The  final  preprocessing  concerns  normalizing  the  moment 
invariants.  It  was  required  to  normalize  the  moments  to  insure 
that  the  large  valued  moments  did  not  bias  the  decision  making 
ability  of  the  classifier.  Very  large  values  could  influence  the 
network  in  the  wrong  direction.  Therefore  both  data  sets  were 
normalized  to  have  a  mean  vector  of  zero  and  a  standard  deviation 
vector  of  one.  This  was  accomplished  by  first  computing  the  mean 
and  standard  deviation  of  each  feature  over  the  entire  training 
set  [1:99-105].  The  J^^  component  of  each  feature  vector  in  the 
training  set,  Xj,  was  transformed  by  the  following  equation; 

’'J  -  ”J 

yj  =  - .  2.  1 

where  mj  and  oj  are  the  mean  and  standard  deviation  of  feature 
J,  respectively  [1:99-105].  This  provides  a  mean  vector  of  0  and 
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a  standard  deviation  vector  of  1,  and  each  feature  is  now  scaled 


identically  [12]. 

As  eluded  to  in  ttie  previous  paragrapn,  tne  normalized 
moment  invariants  are  basically  a  set  of  features  describing  tbe 
target  of  interest.  Therefore,  an  n-dimensional  vector  or 
feature  vector  describes  and  discriminates  the  targets  of 
interest,  where  n  represents  the  number  of  moment  invariants. 
From  here  on  these  moment  invariants  will  be  referred  to  as 
feature  vectors,  representing  the  targets  of  interest.  Each 
target,  known  as  a  class,  will  have  many  examples  of  feature 

vectors  describing  it. 

2,2.2.  Other  Features 

There  were  other  features  worthy  of  classification  within 

the  Roggemann  FLIR  imagery  data  set  [101.  Roggemann  performed 

classification  with  a  decision  rule  of  target  (TGT)  or  non-target 
(NT)  using  his  Implementation  of  a  Bayesian  classifier.  An 
identical  classification  will  be  performed  with  the  ANN 
classifier  studied  in  this  effort  for  comparison.  Target 

features  extracted  from  the  imagery  considered  tanks,  trucks, 


APCS, 

and 

CJ-5 

Jeeps.  Each 

image 

of 

a  tactical 

scenario 

consisted  of 

TGT 

blobs  and 

NT 

blobs. 

Features  extracted  from 

these 

blobs 

and 

considered 

for 

this 

study 

were  the 

length  to 

width 

ratio 

of 

each  blob, 

the 

blob 

mean 

intensity 

minus  the 

background  mean  intensity,  and  the  blob  standard  deviation  of  the 
intensity. 
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2.3.  Introduction  to  Artificial  Heural  HetworKs 

An  ANN  is  basically  a  computing  system  usually  consisting  of 
many  processing  elements  densely  interconnected  via 
interconnection  welgnts.  These  processing  elements  are  commonly 
referred  to  as  nodes  or  cells.  The  construction  of  the  network, 
the  way  in  which  the  nodes  are  connected,  is  known  as  the  network 
architecture. 

Many  architectures  exist  in  the  literature  and  each  is 
highly  dependent  on  the  application  [4:4-22].  In  this  study  the 
ANN  will  be  used  as  a  classifier.  In  general,  classifiers  can 
perform  three  different  tasks,  as  described  by  Lippmann  [4:6]. 
First,  they  can  identify  which  class  best  represents  an  input 
pattern,  when  the  input  has  been  corrupted  by  noise.  Secondly, 
they  can  be  used  as  an  associative  content-addressable  memory. 
In  this  application,  part  of  an  input  is  available  and  the 
complete  input  pattern  is  desired.  Such  an  application  could  be 
found  in  the  decoding  of  information  signals.  The  third  task 
Involves  vector  quantization.  The  idea  is  to  map  an  n- 
dimensional  input  vector  into  an  m-dimensional  output  vector, 
where  usually  m  <  n. 

This  thesis  effort  involves  the  identification  of  a  class 
which  best  represents  an  input.  The  feature  vectors  discussed 
above  will  be  used  as  the  input  patterns  fed  to  the  ANN.  The  ANN 
used  in  this  study  will  project  an  n-dlmensional  feature  vector 
to  an  m-dimensional  output  vector.  This  should  not  be  confused 
with  vector  quantization.  The  resultant  output  describes  the 
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predetermined  class  from  wtiere  the  Input  originated.  Hence,  the 
term  "supervised"  network.  The  network  architecture  used  to 
perform  this  task  Is  discussed  in  the  following  section. 

2.3.1.  Multilayer  Perceptron 

A  common  architecture  used  for  pattern  classification 
applications  is  the  multilayer  perceptron  [4:15-181,  see  Fig. 

2.1.  The  multilayer  perceptron  consists  of  one  or  more  hidden 
layers,  where  each  node  of  each  layer  Is  connected  to  each  node 
In  the  layer  above  it.  This  Implies  .that  each  node  is  a  multl- 


3  Ng-l  32  3 

■^out*  J  =  f(  2  wij’^out,  1  *  ) 

0  <  J  ‘  N3-I 

2  **1~^  2  1  2 

^out»  J  =  f(  2  Wij*f0Ut,l  ) 

0  <  J  <  Ng-l 

1  n-i  I  1 

^out*  J  =  2  ^ij’^in,  i  ) 

0  <  j  <  Nj-l 


Figure  2.  1  Typical  Three  Layer  Multilayer  Perceptron  Architecture 


input,  single-output  element. 
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In  Fig.  2.1,  the  numerical  superscript  notation  denotes  the 
parameter  associated  with  its  corresponding  layer.  The  letter 
(1)  denotes  the  number  of  inputs  to  an  arbitrary  cell,  where  j 
denotes  the  number  of  cells  in  a  layer.  In  the  following 
section,  emphasis  Is  place  on  notation  for  clarity. 

2.3.2.  Hotatlon 

Let  fout,j<^)  denote  the  output  of  the  node  of  a  given 

layer  in  the  network  at  time  t.  Furthermore,  let  fin('^)  denote 
the  pattern  of  inputs  to  that  node.  Mote  that  all  bold  face 
characters  denote  vectors.  For  example,  fin('t-)  is  a  vector  whose 
components  are 

#in(t)  =  t  fin.i(t)  fin.2(t)  ...  fin.q(t)  1^. 

Inputs  are  either  the  outputs  of  nodes  from  the  previous  layer  or 
Information  from  the  environment,  as  shown  in  Fig.  2.1.  The 
interconnection  weight  wj^j  connects  the  output  of  the  i^^  node  in 
the  previous  layer  to  the  node  of  the  following  layer. 

Therefore,  w  is  the  weight  matrix  for  a  given  layer. 

All  though  not  shown  in  Fig.  2.1,  each  node  will  also 
receive  a  number  of  error  signals  back  propagating  from  the  layer 
immediately  above  It.  These  error  signals  make  up  a  vector 
denoted  as. 


«in(^)  =  [  ®in.l<^)  ®in.2^i-)  •••  ®in,r('^) 

Just  as  a  node  receives  a  number  of  weighted  inputs  to  produce  an 
output,  the  node  will  receive  a  number  of  weighted  error  signals. 
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Ttie  weighted  error  signals  are  summed  by  each  node  and  this  total 
error  (e^ot)  used  for  updating  the  nodal  weights  and  back 

propagating  to  the  lower  layers.  The  total  error  is 


e 

tot 


(t ) 


r 


in,  i 


(t). 


The  output  error  of  the  node  Is  denoted  as  e^u^jCt)  and  is 

defined  below  in  section  2.4.3  and  discussed  In  chapter  three. 

The  algorithm  used  in  this  study  requires  the  use  of  time 
derivatives  of  the  above  quantities.  Therefore  primed  (') 
quantities  will  denote  the  time  derivative. 

2.3.3.  Backward  Error  Propagation 

Within  the  confines  of  this  thesis,  backward  error 

propagation  (BEP)  or  backprop  will  be  implied  as  an  entire 

supervised  learning  algorithm  [16:265].  This  algorithm  will  be 
defined  with  a  sigmoidal  transfer  function,  a  square  error 
function,  and  a  weight  update  rule  to  be  defined  in  section  2.4. 

The  multilayer  perceptron  in  Fig.  2.1  is  an  example  of  a 
backprop  network,  as  Introduced  by  Lippmann  [4:15-18].  Input 

signals  enter  the  bottom  of  the  network  and  exit  the  top  as 

output  signals.  The  output  signals  are  computed  as  functions  of 
the  inputs  to  the  node  and  interconnecting  weights.  From  this 
output,  an  error  signal  is  computed  and  re-enters  the  top  of  the 
network  propagating  backwards.  Hence,  the  name  backprop.  Each 
node  within  a  given  layer  contains  a  set  of  weights  that  the  cell 
must  adjust  to  minimize  the  error  signals. 
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Llppmann's  first  order  minimization  metliod  uses  the  bacKprop 
learning  algorithm.  This  algorithm  computes  the  partial 
derivatives  of  the  square  error  function  with  respect  to  the 
weights  of  the  network.  it  uses  these  partial  derivatives  to 
update  the  weights. 

When  computing  an  output,  each  node  within  the  network 
described  by  Lippmann  performs  two  functions  [4:17],  as  shown  in 
Fig.  2.2. 


OUTPUT 


INPUT 

Figure  2.2  Functions  of  a  Single  Cell  on  Forward  Pass 
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First,  it  computes  a  weigtited  sum  of  all  its  inputs,  tlie 
activation 

a  (t)  :  y  w  (t j.f  {ti  ♦  e  .  2.  a 

J  i  ^  J 

The  symbol  0j  is  the  threshold  level  of  the  node.  The 

threshold  is  no  more  than  a  weight,  with  a  corresponding  constant 
input  normally  equal  to  1.  Thus  the  name  threshold,  which  is 
sometimes  referred  to  as  an  offset.  Secondly,  it  passes  this 
weighted  sum  through  a  sigmoidal  transfer  function,  where  the 
output  of  the  nonlinearity  is  the  output  of  the  node.  The 
nonlinearity  most  commonly  used  for  problems  associated  with  the 
multilayer  perceptron  is  a  sigmoid  function, 

1 

*out,  -  •  2.3 

(  1  ♦  expt-cj ) ) 

As  shown  in  Fig.  2.3,  the  output  of  each  node  is  continuous 
between  0  and  l.  ParKcr  [01  refers  to  this  process  as  the 
forward  pass. 


Figure  2.3  Sigmoid  Transfer  Function 
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Once  the  output  of  the  network  is  determined,  an  error 


signal  is  computed  to  measure  tne  performance  of  the  network. 
Tlie  error  signal  is  back  propagated  to  ail  the  layers.  The 
weights  of  each  node  are  adjusted  using  this  error  signal  that  it 
receives  from  all  of  the  nodes  which  receive  its  output.  Parker 
[6]  refers  to  this  process  as  the  backward  pass  and  Fig.  2.4 
depicts  the  function  of  the  cell  on  the  backward  pass. 


ERROR  SIGHAL  IN 


ERROR  SIGNAL  OUT 


Figure  2.4  Functions  of  Single  Cell  on  Backward  Pass 


In  the  following  paragraphs  a  minimization  algorithm  or 
weight  update  rule  is  provided.  The  algorithm  describes  the 
method  in  which  the  node  uses  the  error  signals  to  update  its 
weights,  as  described  by  Llppmann  [4:17]. 

2.3.5.  First  Order  Minimization 

Lippmann's  first  order  minimization  technique  assumes  a 
squared  error  function  to  minimize  [4:17].  The  weight  update 

rule  uses  the  first  partial  of  the  squared  error  function  with 

respect  the  weights  of  each  cell.  Performing  this  partial  yields 
the  following  weight  update  rule  for  an  arbitrary  output  layer 

node: 

Wij(t*l)  :  Wijit)  ♦  n*  dj ‘f  in,  i  ( ^ ). 

The  symbol  n  controls  the  rate  of  convergence,  while  a  is  a 

momentum  scalar.  The  error  signal  dj  for  the  output  node  has 

the  following  form: 

=  *out,  J'<  *  ~  ^out,  J  ~  ^out,  J  ) 

where  dj  denotes  the  desired  output  of  the  output  node.  The 

desired  value  is  commonly  set  to  1  or  O  and  only  one  output  node 
allowed  high  at  a  time.  The  error  signal  for  the  internal  layer 
nodes  is  given  by 

d  .(1-f  i^ya-w. 

J  out,  j  out,  J  ^  k  jk 

The  error  signal  is  a  weighted  summation  over  all  nodes  in  the 
next  higher  layer.  Keep  in  mind,  that  the  output  (fout^ 


above  equation  now  pertains  to  the  corresponding  hidden  layer 
cell. 

The  above  update  equations,  follow  the  gradient  of  steepest 
descent  in  an  Iterative  fashion.  An  input  enters  the  bottom  of 
the  network  and  an  output  Is  computed  at  the  top.  The  partial 
derivative  of  the  squared  error  function  Is  computed  and  back 
propagated  as  an  error  signal.  This  error  signal  is  in  turn  used 
in  updating  the  previous  weight  value  of  an  arbitrary  cell.  This 
Iterative  process  Is  continued  until  the  minimum  of  the  squared 
error  function  Is  found  indicating  optimum  weight  values. 

The  momentum  term  has  the  effect  of  smoothing  the  squared 
error  surface.  It  provides  more  information  on  the  current 
update  cycle  by  adding  a  weighted  change  from  the  previous  cycle. 
The  momentum  term  pushes  the  change  in  weights  further  in  the 
direction  of  the  previous  update.  Appendix  D  considers  the 
momentum  term  in  more  detail. 

2.4.  Introduction  to  Second  Order  Minimization  Techniques 

In  order  for  classification  to  take  place,  there  must  be 
some  learning  rule  the  network  uses  to  minimize  the  error 
associated  with  the  decisions  it  must  make.  Therefore  the 
learning  rule  applies  some  minimization  technique.  Dennis  Ruck 
[12]  implemented  a  multilayer  perceptron  with  a  backprop  learning 
rule  that  applied  the  momentum  method,  as  provided  by  Lippmann 
[4:17]  and  discussed  in  the  previous  section.  The  momentum 
method,  developed  by  Rummelhart,  Hinton  and  Williams  [13],  is  a 
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variation  of  tHe  steepest  decent  mettiod  developed  by  Werbos  [15]. 
Botli  methods  are  first  order  methods  because  they  only  involve 
the  first  derivative  of  the  quantity  being  minimized.  A  second 
order  algorithm  maxes  use  of  the  second  derivatives. 

To  understand  the  difference  between  first  and  second  order 
techniques,  ParXer  draws  upon  a  simple,  but  quite  effective 
analogy  which  is  quoted  below. 

"Imagine  that  you  are  at  the  top  of  a  ridge.  Below  you 
is  a  long,  narrow  valley  that  slopes  gently  down  to 
your  right.  Far  off  in  the  valley  to  the  right  is  the 
sXi  lodge,  to  which  you  wish  to  return.  One  way  to 
get  to  the  lodge  is  to  simply  sit  on  your  sXis  and  let 
gravity  move  you.  You  will  zip  quicXly  down  the  slope 
till  you  hit  the  valley,  but  once  in  the  valley  you 
will  coast  very,  very  slowly  till  you  reach  the  lodge" 

[6]. 

This  is  equivalent  to  first  order  techniques  which  follow  the 
gradient  of  steepest  decent.  Fast  convergence  down  the  slope  may 
give  way  to  very  slow  convergence  in  a  valley.  On  the  other  hand 
consider  an  alternate  path. 

"A  better  way  to  get  back  to  the  lodge  is  to  slightly 
drag  one  of  your  sXis  so  that  you  cut  across  the  slope, 
maintaining  a  constant  speed  till  you  hit  the  lodge. 
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This  is  equivalent  to  a  second  order  algorithm,  which 
has  a  constant  convergence  rate  under  appropriate 
conditions"  [81. 

The  algorithm  discussed  in  the  following  paragraphs  is  an 
approximation  developed  by  David  Parker  [7:593*600;  8]  to  the 
second  order  Newton's  method.  This  algorithm  is  a  more  general 
case  of  the  steepest  decent  and  momentum  methods.  By  adjusting 
the  learning  parameters  correctly  the  algorithm  can  be  made  to 
perform  as  the  steepest  decent  or  momentum  method.  Below  is  a 
brief  presentation  of  the  algorithm,  for  a  more  thorough 
explanation  see  [8]  and  appendix  A. 

2.4.1.  Performance  surface 

For  now,  the  quantity  being  minimized  for  this  study  is  an 
independent  variable  of  some  performance  function  of  the  network. 
Parker  denotes  the  Instantaneous  performance  of  a  network  by 
s(  fin('t).  w(t)  ).  The  instantaneous  performance  is  dependent  on 
the  current  set  of  inputs  and  also  the  current  set  of  weights. 
However,  in  general,  the  derivation  begins  by  defining  an  average 
instantaneous  performance,  which  depends  only  on  the  weights  of 
the  network.  The  average  Instantaneous  performance  is  given  by, 

[  -y{t-T) 

avg(  s(w|t))  )  :  p*J_<oS(  2. 't 

Parker  notes  that  the  scalar  quantity  p  is  roughly  the  inverse  of 
the  amount  of  time  the  average  is  considered.  Basically,  the 
instantaneous  performance  will  be  exponentially  weighted  over  all 
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Input  patterns  ^  weights  w(t)  at  time  t.  In 

other  words,  the  average  performance  provides  a  certain  amount  of 
memory  from  information  of  past  inputs.  The  exponential  term 
insures  that  emphasis  is  placed  on  the  most  current  inputs  [8]. 

A  graph  of  avg  (  s(w)  )  as  a  function  of  w(t)  would  define  a 
performance  surface  at  a  fixed  time  t.  Thus,  the  performance 
surface  changes  over  time  with  each  new  set  of  inputs.  According 
to  Parker,  the  task  of  the  hackprop  network  is  to  find  the  lowest 
point  on  the  performance  surface,  and  then  follow  that  point  as 
the  surface  changes  with  time  (81. 

2.4.2.  Second  Order  Differential  Equation 

Parker's  derivation  of  the  second  order  differential 
equation  is  very  thorough  and  well  explained.  Therefore,  no 
attempt  will  be  made  to  duplicate  his  work  in  full.  However,  it 
will  be  time  well  spent  to  highlight  the  significant  intermediate 
equations,  as  well  as  the  final  result.  See  (8]  and  Appendix  A 
for  further  study. 

Parker  derives  the  algorithm  with  an  objective  of 
optimality  in  mind.  Assuming  the  weights  have  converged  to  a 
minimum  of  the  performance  surface,  then  as  the  performance 
surface  changes  with  time,  the  weights  should  follow  this 
minimum.  The  first  step  involves  the  derivation  of  Newton's 
method  from  an  optimality  criterion.  The  goal  is  to  find  the 
minimum  of  the  performance  surface  by  updating  the  weights.  So 
the  first  step  is  to  take  the  derivative  of  both  sides  of  Eq.  2.4 


With  respect  to  the  weights.  Since  the  performance  surface  is 


ctxanging  wltti  time,  it  Is  desired  to  have  the  weights  follow  the 
minimum  as  time  changes.  This  requires  taking  the  time 
derivative  of  hoth  sides.  By  doing  so  Eq.  2.4  is  transformed  to 
the  following  (see  appendix  A): 


where  the  functional  dependencies  on  t,  have 

been  suppressed  for  convenience.  The  star  (»)  notation  denotes 
the  optimal  value  of  the  weights  (w*).  The  following 
relationship  can  be  made  since  the  network  performance  is  a 
function  of  time  through  the  weights: 


As  Parker  points  out,  the  explicit  first  order  differential 
equation  of  Eq.  2.5,  known  as  Newton's  method.  Is  valid  only  if 
the  average  second  derivative  matrix  is  invertible  (it  is  not) 
[7:593-600;  8).  By  actually  computing  the  determinant  of  the 

time  average  second  derivative  matrix,  reveals  the  matrix  to  be 
singular  and  thus,  not  invertible.  This  is  shown  in  appendix  D. 
Regardless,  inverting  this  matrix  is  entirely  too  expensive. 
Consider  n  weights  in  the  network,  the  number  of  operations 
performed  is  a  function  of  n^  or  O(n^).  The  reason  behind  the 
potentially  enormous  number  of  operations  is  that  each  component 
of  the  matrix  must  be  computed.  This  entails  computing  the 
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average  second  partial  with,  respect  to  every  combination  of 


I 

welgtits,  followed  by  inverting  tbe  matrix  for  eacb  cell  in  the 
network.  These  computations  must  be  made  before  the  weights  are 
updated.  A  very  unpleasant  thought!  This  task  has  been  avoided 

I 

by  other  researchers  using  quasi-Newton  methods  reducing  the 
number  of  operations  to  0(n^). 

Parker,  on  the  other  hand  chose  a  different  route. 
Rewriting  Eq.  2.5  as 


avg 


d^s 

aw^dw-T 


dw» 

at 


2.  6 


an  iterative  approach  is  applied  to  obtain  a  close  approximation 
to  the  time  derivative  of  w*  [8).  Appendix  A  describes  an 
iterative  approach  in  general.  Following  this  approach,  a 
second  order  differential  equation  for  an  optimal  path  for  the 
weights; 


a^w* 

at2 


as 


B*avg 


a^s 

aw*aw*T 


^  aw* 
,  at 


2.  7 


where  p  controls  the  convergence  of  the  algorithm.  The  symbol 
w*  denotes  the  approximate  values  of  w".  The  derivation  could  be 
stopped  here  with  an  attempt  to  implement  Eq.  2.7.  However, 
Parker  chooses  to  continue  since  Eq.  2.7  is  only  an  approximation 
for  the  optimal  path  [8]. 

According  to  Parker,  leakage  terms  are  required  to 
guarantee  convergence  [7:593-600;  8],  see  appendix  A.  The  final 
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version  of  the  algorithm  used  for  classification  in  this  thesis 


Is  given  hy, 


The  matrix  I  Is  the  Identity  matrix,  and  is  introduced  to  ensure 
that  matrices  are  added  to  matrices.  Where  the  constants 

ag  :  1 1  *  ^  2* 

a3  : 

aif  :  1  ^  *  1  g,  and 
as  :  0 

are  learning  parameters  and  usually  small  positive  numbers.  The 
constants  1^  and  ig  are  the  leaKage  terms  introduced  by  Parker 
[8],  see  appendix  A. 

2.4.3.  Second  Order  Implementing  Kqnations 

Implementing  the  algorithm  of  Eq.  2.8  is  not  a  straight 
forward  exercise.  Chapter  three  will  derive  the  implementing 
equations  in  detail.  There  are  several  ways  to  implement  the 
algorithm  and  the  implementing  equations  Parker  uses  are  listed 
below.  The  equations  describe  a  forward  sweep  and  backward  sweep 
through  the  network.  On  the  forward  sweep  each  cell  computes  its 
own  copy  of  the  following: 


8 


*out,  k  =  ^out,  k<* in,  k* 


Wk  :  a3*At.WK  ♦  a5‘AWK. 


df 


f .  :  (ail 

out,  k 

in 


df 

♦  _ ftill 


•  f' 

k  *  dwT 


•W’  . 

k  K 


Tlie  calculations  for  the  backward  sweep  are  as  follows: 


tot  in, 


1*^  •  e  , 
i  in 


•  _  #  • 

!  :  ^  e  :  lT‘e  , 

tot  in,  i  in 


af 

e  . — aiii 


tot 

k 

in 

af 

f  a2f 

A  A  • 

out 

c*  . 
tot 

▼  c  * 

k  ^0^ 

L  af  awT 

in 

in 

d2f 

>  _ aiilL 


af  afT 

in  in 


•f  • 
k 
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:  AWk  —  ag'At^^WK  —  ai(>At>AWK 


♦ 


a  ‘At^.e 
1  tot 


At  •  e> 

tot 


df 

). — ftlit 

dm 


R 


♦ 


At  •  e  • 

tot 


f  d2f 
_ ftUl 

,  dmdm'^ 


d2f 


V  ♦ 


QUl 


dWdfT 

in 


k 


f* 

in 


j 


^K*l  =  ^k  +  AWk4  1. 

The  above  equations  describe  a  discrete  implementation  of 
Eq.  2.7  for  a  single  cell,  where  k  denotes  the  discrete  time 
step  Cfl)>  The  implementation  may  be  simulated  with  a  computer 
program.  They  are  listed  here  for  those  readers  who  wish  to  skip 
the  detailed  implementation  stage  discussed  in  chapter  three. 
Figure  2.5  below  demonstrates  how  the  cell  varies  from  Figs.  2.2 
and  2.4  in  the  amount  of  information  it  must  process. 


Figure  2.5  Signal  Flow  Through  Cell  Using  Second  Order 
Implementation  [fl] 
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Later,  in  cliapter  five,  tne  results  obtained  from  a  Bayesian 


classifier  will  be  compared  to  tbe  results  obtained  from  an  ANN 

classifier  for  a  given  set  of  features.  In  light  of  this,  the 

text  below  provides  an  introduction  to  the  concept  of  a  Bayesian 
classifier.  In  particular,  the  approach  Roggemann  used  in  his 

implementation  of  a  Bayesian  classifier  will  be  discussed  [10]. 
The  discussion  begins  with  a  statement  of  Bayes  rule,  followed  by 

its  application  in  the  Bayes  classifier. 

2.5.1.  Implementation 

Recall  that  Bayes  rule  is  defined  in  the  following  way; 

p[A.  B] 

p[A/B]  :  - 

PtB] 

and 

PtA,  B] 

p[B/Al  =  - 

PCA] 


such  that 

P[A/B) ‘PtB) 

P[B/A]  =  - . 

P[A) 

The  probability  of  the  occurrence  of  B  given  A  (p[B/A])  is  equal 
to  the  product  of  the  probability  of  A  given  B  (p(A/BJ)  and  the 
probability  of  B  (p[B]),  divided  by  the  probability  of  A  (p[A)). 

For  the  Bayes  classifier,  the  idea  is  to  determine  the 
probability  of  the  occurrence  of  a  target  (TQT)  given  some 
feature  (F)  describing  the  target,  or  p[TGT/F].  Another  decision 
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tlie  classifier  must  make  Is  to  determine  tne  probability  of  a 
non-target  (NT)  given  a  feature,  or  p[NT/Fl.  Eacb  of  these 
conditionals  is  determined  from  the  probability  of  occurrence  of 
a  feature  given  a  TGT  or  a  NT.  Hence,  the  Known  conditionals 
exhibited  are  in  the  form  of  ptF/TGT)  and  p[F/NT].  Thus, 
Roggemann's  implementation  considered  classifications  of  target 
(TGT)  and  non-target  (NT)  only.  In  other  words,  by  applying 
Bayes  rule  it  is  desired  to  compute  the  following: 

p[F/TGT) ‘PCTaT] 

p[TGT/F]  =  - 

p[Fl 


and 


PtF/NT) ‘PCNT] 

PlNT/F]  =  - 

PCF) 


The  value  of  the  probability  of  a  feature  is  given  as: 

p[F]  =  p(F/TGTJ  -prTGT)  +  p [F/NT]  • p [NT] . 

Applying  the  "Principle  of  Indifference"  the  a  priori 
probabilities  of  p[TGT]  and  p[NTJ  are  equal  to  0.5  [5:1-53). 

To  consider  multiple  features  (F^,  Fg,  ....  F^  ),  it's 

necessary  to  impose  a  conditional  dependence  on  the  conditionals 
for  multiple  feature  decisions,  such  that 

p[Fi,  Fg,  ...,  Fn/TGT]  =  p[Fi/TGT)  .p[F2/TGT]  ..  .  .  •p[Fn/TGT] 

n 

=  P[F  /TGT) 

1=  1  ^ 


Applying  Bayes  rule  while  considering  the  above  alterations, 
provides: 

Tt  p[Fi/TGT] ‘PCTGT] 

p[TGT/Fi,  . ^nl  =  - 

ir  pCFj^/TGT] ‘PCTGTl  ♦  tr  p[Fi/NT]  ‘PtNT] 

and 

F  P[Fi/HTl ‘PCNTl 

p[NT/Fi.  Fa . Fnl  =  - 

IT  p[Fi/TGTl  .p[TQT]  +  ft  P  [F^/NT]  •  p  [NT] 

where  n  is  understood  to  range  over  all  features,  i  :  1  ..  n. 

Now  that  the  desired  a  posteriori  conditionals  have  been 
defined,  it's  necessary  to  define  a  decision  criterion.  The 
criterion  used  by  Roggemann  [10]  is  the  maximum  a  posteriori 
(MAP)  decision  criterion  approximating  the  minimum  probability  of 
error.  Simply  choose  TGT,  if  the  p[TGT/F]  >  p[NT/F],  otherwise 
select  NT  [5:1-53]. 

2.5.2.  Addition  of  Bins 

The  conditional  probability  distribution  function  (PDF)  of 
an  arbitrary  feature  given  a  TGT  (or  NT)  is  in  general  a 
continuous  function.  Therefore,  the  probability  of  a  feature 
lying  on  a  single  feature  point  given  a  TGT  (or  NT)  is  zero.  For 
Instance, 

p[F  :  fo/TGT]  :  0. 


Therefore,  the  conditional  PDF  is  broken  up  into  several  uniform 


Incremental  regions,  called  bins.  Hence.  tbe  conditional 
probability  actually  desired  is  given  by 

PCfo/'^<3T  <  F  <  fi/TGT). 

The  number  of  bins  varies  and  allows  the  construction  of  a  new 
discrete  PDF  as  a  function  of  the  number  of  bins.  Figure  2.6 
displays  a  typical  conditional  PDF  (p[F:OBJECT  TRUTH])  as  a 
function  of  the  number  of  feature  bins.  The  (+)  notation  is  the 
conditional  PDF  for  TGT  data,  while  (o)  indicates  NT  data,  where 
OBJECT  TRUTH  represents  TGT  or  NT. 


2.6.  Smumary 

This  chapter  focused  on  the  background  material  necessary 
for  understanding  the  thesis  effort  undertaken.  The  chapter 
began  with  a  discussion  of  Imagery  preprocessing.  It  was  found 
that  the  images  were  reduced  to  a  set  of  features  describing  and 
discriminating  the  targets  of  interest.  These  features  represent 
the  final  interpretation  of  the  real  world.  Next,  it  was  desired 
to  introduce  the  ANN  classifier  used  in  this  study,  the 

multilayer  perceptron.  The  network  required  a  learning  algorithm 
and  backprop  was  chosen  using  first  and  second  order  minimization 
techniques.  Following  a  discussion  on  the  performance  surface 
used  for  this  study,  the  second  order  derivation  introduced  by 
David  Parker  was  highlighted.  Appendix  A  discusses  the 
derivation  in  more  detail.  Equation  2.6  represents  the  second 

order  approximation  to  Newton's  method.  Appendix  D  discusses 

convergence  considerations  for  a  true  second  order  Newton's 
method.  It  is  in  appendix  D,  where  it  is  reasoned  that  Eq.  2.5 
must  be  approximated.  Chapter  two  concluded  with  a  brief 
introduction  to  the  Roggemann  implementation  of  a  Bayesian 
classifier. 

The  following  chapter,  discusses  in  detail  the 
implementation  of  the  differential  equation  provided  by  Eq.  2.8. 
The  approximations  assumed,  as  well  as  the  efforts  to  reduce  the 
computational  overhead  will  be  covered.  The  latter  part  of 

chapter  three  will  consider  the  initial  network  setup. 
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I 

I  3.  Second  Order  Algoritlim  and  MetworR  Convergence 

3.1.  Introdnction 

•  Tlie  first  five  sections  of  tliis  chapter  are  devoted  to  the 

implementation  stages  of  Parker's  second  order  derivation 
introduced  in  chapter  two.  Specifically,  the  implementation  of 
Bq.  2.6  will  be  discussed.  The  following  sections  describe  in 
detail  the  approximations  made,  along  with  a  discussion  on  the 

mathematical  notation.  In  addition,  the  implementing  equations 

*  of  section  2.4.3  will  be  explained  in  full. 

Once  the  algorithm  used  in  this  study  has  been  fully 
defined,  a  discussion  on  the  initial  state  of  the  network  is 

^  necessary.  For  example,  what  are  the  initial  network  parameter 

values?  How  are  the  parameters  chosen?  These  questions  will 

be  addressed  in  section  3.6. 

I 

3.2.  Definition  of  Network  Performance 

For  the  problem  at  hand,  the  parameter  to  be  minimized  will 
be  defined  to  be  the  squared  error  function.  Where  the 
instantaneous  performance  for  a  single  output  node  is  defined  to 
be 


s  (  f in  ^  ^  J ' 

w(t))  :  (  dou-^ft)  ^out 

(  f  In  1  ^  > 

w{t))  )2 

:  {  e(t)  )2. 

The  performance 

surface  is  defined  to 

be  a 

function 

of 

the 

desired  output. 

dout^^^'  actual 

output 

at  time 

t. 

and 
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will  be  interpreted  as  the  square  of  tbe  error  e(t). 


As  per  Eq.  2.2,  tbe  output  bas  tbe  following  form, 


( 1  ♦  exp(— f*^  ‘W  +  ©)  ) 
in 


3.  1 


In  this  context,  w  pertains  to  Just  those  weights  associated  with 
the  cell  in  question. 

Since  there  can  be  many  output  nodes,  there  must  be  an 
expression  for  s  to  accommodate  the  vector  of  error  signals 
generated  from  the  output  layer.  Therefore,  in  general  s  may  be 
rewritten  as 


s  :  e'^*e  3.  2 

The  dependencies  of  s  on  fm^t)  and  w(t),  and  e(t)  on  t  has  been 
suppressed  for  notational  convenience. 

3.2.1.  First  Partial  Derivative 

How  that  a  quantity  of  performance  has  been  defined,  the 
first  and  second  partial  derivatives  of  the  performance  indicator 
(s)  need  to  be  defined.  The  first  partial  derivative  of  s  with 
respect  to  the  weights  is  defined  as  follows, 


since 


ds  d 

—  z  — (e’^'e) 

aw  aw 

aeT 

:  2* - •e, 

aw 


3.  3 
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3.3. 


The  partial 


i 


aeT 

ae 

-  .T. _ 

— *  V 

aw 

dw 

The  chain  rule  was  used  to  obtain  Bq. 
derivative  of  the  transpose  of  e  with  respect  to  w  is  a  matrix 
and  is  defined  in  appendix  B.  In  this  context,  w  implies  all  the 
weights  of  the  network. 

3.2.2.  Second  Partial  Derivative 

The  second  partial  of  s  with  respect  to  the  weights  is  found 
by  applying  the  chain  rule  once  again.  Therefore, 


d^s  d2 

- -  :  - 

awaw*^  awaw*^ 


a  f  a 

— ♦  - (cT.e) 

aw  V.  aw"^ 


j 


d  r  ae 

:  — .  2  *6^. - 

aw  V  awT 


:  2* 


^  ae"^  de 

!■  I  •  -■  —I 

,  dw  aw"^ 


eT 


a^e  ^ 
awdw"^  j 


a^s 

awaw"^ 


2* 


"  aeT  de 

,  aw  aw'T 


a2eT  A 

- *6 

awaw'T  j 


3. 


where  the  partials  of  e  and  e'^  with  respect  to  w"^  and  w 
respectively,  and  the  second  partial  of  e"^  with  respect  to  w  and 
w"^  are  defined  in  appendix  B.  Again,  in  the  context  above,  w  is 
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a  vector  of  weights  containing  all  the  weights  in  the  network. 
3.2.3.  Approximate  Average  Second  Partial  Derivative 

Parker  notes  that  implementing  Eq.  2.8  explicitly  requires 
0(n2)  operations  [8].  In  order  to  reduce  the  number  of 
operations  the  network  would  have  to  perform  to  update  the 
weights,  some  approximations  are  in  order.  The  first 

approximation  concerns  the  average  second  partial  derivative  of 
the  network  performance  quantity,  s.  Basically,  to  define  the 
average  second  partial  of  the  network  performance, 


avg 


d^s  ^ 
dwawT  , 


would  require  an  iterative  approach  to  obtain  a  solution  to  this 
average  second  derivative  matrix.  The  elements  of  the  matrix 
depend  on  the  behavior  of  the  cell  at  some  point  in  the  past, 

considering  the  current  weight  values.  Therefore,  they  cannot  be 
computed  Without  going  back  in  time  [7:598].  For  some 

applications,  including  the  application  of  this  effort,  this 
would  require  large  storage  space  in  the  form  of  memory. 
However,  Parker  suggests  an  alternative  for  the,  class  of 

semilinear  functions  [8].  The  assumption  is  to  approximate  the 

average  of  the  second  partial  of  the  network  performance  with  the 
current  instantaneous  value,  such  that 


d^s  f  d^s  " 

-  s  avg  - 

dwdw"^  I  dwaw"^  J 
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a-*! 


I 

I 

since  the  network  cells  have  semlllnear  transfer  functions  in  the 
sigmoid  function  [16:264].  according  to  Parker  this  should  be  a 
good  estimate  [8]. 

I 

As  It  turns  out,  this  approximation  plays  a  significant  role 
in  arriving  at  the  Implementing  equation  of  section 

3.3.  If  you  recall,  avg(s)  was  a  function  of  w(t). 

The  only  dependence  of  s  on  time  was  through  w(t).  Now  that  the 
instantaneous  value  is  desired  the  network  performance  is  a 
function  of  fin(^)  w(t),  where  both  are  functions  of  time. 

The  above  approximation  is  also  a  desired  result.  Recall 
from  Eq.  2.3  that  the  average  network  performance  applied  more 
weight  to  the  most  current  input.  The  input  data  set  used  for 
this  study  was  a  set  of  feature  vectors  describing  the  target  of 
interest.  There  is  no  reason  to  believe  at  this  point,  that  any 
one  of  these  feature  vectors  is  more  important  than  the  others. 
Therefore,  this  assumption  is  considered  a  good  estimate. 


3.3.  Algorithm  Development 

The  development  begins  with  a  restatement  of  Eq.  2.8,  where 


a^w 

as 

-  a  *  - 

dt2 

—a  ♦ 

^  aw 

a  *I  ♦  a  'avg 

^  a2s  ^  ' 

[2  3  1 

,  awawT  J  , 

'■  , 

"  a2s  > 

'' 

a  •  I  *  a  • avg 

'I  5  1 

V 

,  awaw"^  j 

J 

dw 

at 
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Using  the  approximation  of  Eq.  3.5,  Eq.  3.6  becomes. 
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d^w 

_ 

a  •  I  «  a  < 

2  3 

V. 

d2s 

dt2 

>  ow  ! 

dwdw*^ 

r 

a  •  1  ^ 
1  ^ 

d^s  '1 

dw 

5  dwawT  J 

at 

As  pointed  out  in  section  2.4.1,  Parker  reasoned  that  since  the 
average  network  performance  was  a  function  of  t,  because  the 
weights  were  a  function  of  t,  that  he  could  make  the  following 
relationship; 


a 

^  as  ^ 

'  a^s  ^ 

1  aw 

at 

.  aw  j 

,  dwdw"^  j 

1  at 

For  Eq.  3.7  to  he  totally  correct,  this  action  must  be  reversed 
since  it  is  clear  that  the  instantaneous  second  partial  of  s  is  a 
function  of  w(t).  Therefore,  substituting  Eq.  3.8  in 

Eq.  3.7,  and  removing  the  identity  matrix  I,  and  then  expanding, 
the  following  equation  is  obtained: 


d^w 

dt2 


as 

a^s 

-aj  • - ag'W  - 

aw 

a3  - - -  w 

awdw"^ 

dw  a 

—  a  • —  —  a  ♦  — 

I"  as  " 
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*♦31  5  at 

1  aw  ; 

substitute  Eqs 

.  3.3 

and  3.4 

into  Eq.  3.9, 

such  that 
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2 


dt2 


— 2*ai‘ - *0  —  ag*w  —  aif*  — 

dw  dt 


—  2  •  a  • 
3 


deT  de  d^eT 

aw  dwT  awdw^ 


•  w 


a  (  aeJ 


—  2  ‘  a 


at 


aw 


and  by  applying  tbe  cbaln  rule, 


a 

1 

a 

'  ae'T  'I 

de"^  de 

♦  ^  * 

at 

1  aw  J 

aw 

1 

.  at  j 

dw  at 

and  by  regrouping  like  terms,  Eq.  3.9  becomes. 


a^w 

aeT 

dw 

-  : 

— 2*ai* - *0  —  ag’W  — 

an  •  — 

at2 

aw 

at 

ae'^  (  de  de 

-  2* - -  a  - - ‘W  ♦  a  •  — 

aw  V  ^  awT  ®  at 


'  a^e*^  a  i 

^  de^  ^ 

,  ^  awdw*^  ®  aw' 

.  at  j 

J 
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Recall  that  the  error  was  defined  as  a  vector,  since  there 
may  be  several  error  signals  to  back  propagate.  This  implies 
that  in  the  hidden  layers,  individual  cells  will  be  responsible 
for  summing  the  input  error  signals  and  using  this  sum  for 
updating  the  cell's  weights.  To  clean  up  Eq.  3.10,  the  following 
quantities  are  defined: 


3- 


®in  ^  2.® 


and 


r 

e  :  ^  e  :  l^.e  3.  ii 

tot  in,  1  in 


Tlie  upper  bound,  r,  is  the  total  number  of  error  signals 
propagated  to  an  Individual  cell.  For  Instance,  the  upper  bound 
on  the  total  error  (e^Q^)  propagated  to  the  output  layer  cells  is 
r  :  i.  The  number  of  error  signals  propagated  to  the  hidden 
layer  cells  depends  on  the  number  of  cells  in  layer  immediately 
above  it.  The  upper  bound  will  vary  from  layer  to  layer  and 
remain  constant  within  a  layer.  Rewriting  Eq.  3.10  with  Eq.  3.ii 
in  mind,  yields: 


— ai • 


r 

—  e  • 
tot  ^ 


deT 

aw 

1 

a 

CVJ 

1 

1 

a<i  • — 

dw 

at 

"  ae 

ae  ■ 

a  - - ‘W  ♦ 

a  •  — 

.  3  e^T 

^  at  ; 

d2eT 

a 

a  A  a 

'■  ae*^ 

o  •  •  W 

^  dwaw^ 

T  a  •  “““ 

®  aw 

.  at 
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The  time  derivatives  of  e  are  now  considered.  From  Eq.  3.2 
it  is  clear  that  the  error  is  a  function  of  the  desired  and 
actual  outputs.  For  the  cases  studied  in  this  thesis  effort,  the 
desired  output  is  considered  to  be  piece-wise  continuous  and 
constant  over  the  time  in  question.  For  any  given  input  vector, 
each  of  the  output  desired  responses  is  either  0  or  1.  Thus,  its 


4 
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time  derivative  is  zero  and  the  time  derivative  of  the  error 


signal  for  any  single  output  cell  in  the  network  is 

at  at 


Since  fout  time  dependent  through  imit)  and  w(t),  the  time 

derivative  may  be  expressed  as  a  function  of  these  time  dependent 
quantities.  Therefore  the  time  derivative  of  the  error  signal 
of  a  single  cell  becomes, 


de 

f 

aw 

df 

df  ^ 

-  :  — 

out 

> —  ♦ 

out  , 

_ _ in 

at 

,  aw"^ 

at 

afT 

at  , 

in 
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Also  note  that 


out 

-  I -  3.  1^ 

awT  awT 

and  Similarly 

^^out 

-  : -  3.  15 


aw 

aw 

for 

a  single  cell.  Substituting 

Eqs.  3.13, 

3.14 

and  3.15  into 

3.12, 

the  weight  update  rule  for 

a  single 

cell 

becomes: 
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A 


d2w  «^out 

-  =  ai*etot* - *2*^  -  —  3- 

at  dw  at 


af 

r  af 

r  af  aw  af  af  ') 

2. — aul. 

a  • — t  a  ‘ 

_ aul.  t  _ aul. _ La 

aw 

1  3  awT  5 

.  awT  at  afT  at  ; 

; 

in 


♦  e 

tot 


(  a2f 

a  - - ttiLL.^ 

,  3  awawT 


d  f  df  aw 
a  —  — aui. — 

®  awl  dwT  at 


af  df 
_ ant. _ La 

afT  at 
in 


Differentiation  of  the  sigmoid  function  is  a  straight  forward 
exercise  and  will  not  be  covered  here.  Appendix  C  describes  the 
various  derivatives  of  the  sigmoid  in  detail.  Regrouping  like 
terms  in  Eq.  3.16  results  in 


a2w  3fout  3^ 

— T  =  etof - a2*w  ~  aj4«  — 

at2  aw  at 


af  1 

f  af  1 

aw  ^ 

df 

df  A 

2. — aul. 

out , 

a  ‘W  ♦ 

a  • —  1 

,  a - 

_ La 

aw  ' 

,  1 

1  3 

®  at  J 

®  afT 

at  ; 

in 


f  a2f  , 

r 

aw  ^ 

a2f  df  A 

e  * 

- ftlLl. 

a  ’W  t  a  ♦ — 

♦  a  • - aiLl* — 

tot 

[  awaw"^  ' 

1  3 

5  at  J 

®  awaf"^  at  ; 

in 


Since  the  artificial  neural  network  (ANN)  for  this  study 
will  be  run  as  a  computer  simulation,  discrete  mathematics  are 
introduced.  Difference  equations,  in  lieu  of  differential 
equations  are  required.  The  following  set  of  approximations  are 
required  for  the  transformation: 


fl 


A 


i 


4 


i 


M 
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I 

1  a2w  AWk+1  -  AWfc 

-  s  -  :  - 

dt2  At2  At2 

dw  AW]{ 

•*  # 
at  At 


AWk  =  -  Wjj_i. 

w|{  :  a3‘At.Wn  ♦  as'Aw^. 

A^in  ^*in 
— ^— ■  s  » 

At  at 
<’in,  K  -  a5«Af  in.  k* 

At  =  the  amount  of  time  occurring  between  time  step  k  and  k  +  1. 

The  quantities  and  f 'm,  k  thought  of  as  average  values, 

approximating  the  discrete  time  derivatives  of  Wj^  and  fm.K* 
respectively.  Recall,  that  the  error  surface  is  changing 
instantaneously  with  each  new  input  vector.  The  estimates  of  the 
time  derivatives  of  and  ^in.K  provide  the  networK  with 

information  of  how  the  surface  is  changing  with  time.  These  time 
derivatives  will  no  doubt  be  used  in  updating  the  cell  weights. 


With  these 

approximations,  apply 

the 

first  two 

by 

substitutions 

and 

then  multiply  each  side 

of 

Eq. 

3.17  by 

At2, 

Add  Awj^  to 

both 

sides  of  the  equation 

and 

then 

apply 

the 

approximate 

discrete  time  derivatives  of 

and 

^in.k- 

The 

result  is 
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df 


Aw 


Kt  1 


(  1  —  a  ‘At  )*Aw  —  a  •At^.w  ♦  a  ‘At^.e  • — 


tot 


dw 


df 

_  2. At* — ttUl 


dw 


f  df 

df 

out 

.  ,  out 

4  ^  • 

K 

,  dwT 

K  ^  dfT 

k  in.  K 

in 

*  At  •  e 


tot 


d2f 

out 

d2f 

iw  1  out 

dwdw'^ 

k  ^  dwdfT 

in 

K  in,  K  j 
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The  final  approximation  to  the  implementation  stage  concerns  the 
time  derivative  of  e^ot*  Since 

de  ^^out 

at  at 


for  a  single  output  cell,  then  e^Q^^  may  be  approximated  by 


df 

e’  1—2* 

_ OUl 

,y,,  4  - auL 

•f  • 

tot 

1  dwT  j 

K  ^  dfT  j 

K  in.  k  j 

in 

The  final  weight  update  equation  becomes 
AWjcm  :  (  1  - 


♦  e  • 
tot 


for  At  :  1,  and 


4  - 

-  ^a-WK 

df 

i  ♦  e’  )  • — SiUi 

tot  tot 

k 

d2f 

out 

d2f 

rW  1  out 

•  f  ’ 

k  1  n,  k  ^ 

,  dwdwT 

K  ^  dWdfT  i 

in 

1  -  Wk  ♦ 

Awkm- 

3.  19 
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Further  reductions  will  be  made  on  Eq.  3.19  for  programming 
convenience  in  section  3.5.  However,  Eq.  3.19  represents  the 
most  descriptive  form  of  the  generalized  second  order 
approximation  to  Newton's  Method. 

3.4.  Generalized  Second  Order  Algorithm 

As  eluded  to  earlier,  Parker's  second  order  approximation 
is  a  generalized  version  of  the  steepest  descent  and  momentum 
methods.  From  Eq.  3.19,  it  is  readily  seen  that  this  equation 
contains  the  steepest  descent  search  algorithm  and  the  momentum 
method.  The  proper  selection  of  learning  parameters,  a^.  ...  a5, 

will  yield  the  desired  algorithm. 

3.4.1.  Steepest  Descent  Algorithm 

If  the  steepest  descent  search  algorithm  is  desired,  let  aj 
equal  a  small  positive  number  between  (O,  1)  and  let  a^  equal  i. 
Set  the  other  learning  parameters  equal  to  0.  With  the  above 
learning  parameters,  the  single  cell  weight  update  rule  is 
reduced  to  the  following: 

af 

Aw  :  a  • — *6 
kfl  1  0^ 

:  2*31 ‘f out,  K' ^  ^  ~  ^out,  k  ***  ^  ~  ^out,  k  •*^in,k- 
The  above  equation  is  equivalent  to  Lippmann's  expression  for  the 


change  in  weights  [4:171,  by  using  the  following  substitutions: 


-  ^out,  K*<  ^  ~  ^out.k  >*•  ^  ~  ^out,  k  > 

such  that 

Awk+i  =  '^•^k**in.  k- 

Hence,  tlie  algorithm  of  Eq.  3.19  is  reduced  to  a  gradient  of 
steepest  descent  search  algorithm.  In  addition,  one  of  the  terms 
of  the  algorithm  has  been  explained. 

3.4.2.  Momentum  Algorithm 

In  a  similar  fashion,  the  momentum  algorithm  is  obtained. 
Set  aj  and  to  small  positive  numbers  between  (O,  1)  and  the 

other  learning  parameters  equal  to  0.  Again  Eq.  3.19  is  reduced 
and  the  update  rule  becomes; 

af 

Aw  :  — a  *  — « e  ♦  (  1  —  a  )  •  Aw  . 
ktl  1  tot  H  k 

Again,  the  above  equation  may  be  compared  to  the  algorithm 

Lippmann  introduces  [4:17].  Let 

0=1  — 

and  again  using  the  substitutions  of  section  3.  'i.  1, 

AWkm  =  n.dK*f  in.  k  ♦  cj-Aw^. 

Again,  Eq.  3.19  is  reduced  to  obtain  the  momentum  method  and  a 

second  term  of  the  algorithm  has  been  identified.  Appendix  D 

offers  further  insight  to  the  momentum  term  and  possible 
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convergence  applications. 


3.4.3.  Additive  noise 

The  leakage  terms  introduced  by  Parker,  not  only  produce  a 
momentum  term  to  enhance  convergence,  but  in  addition,  induce 
noise  into  the  algorithm.  The  terms  associated  with  the  learning 
parameters,  ag  and  a3,  produce  the  effect  of  noise.  Recall  that 


33  was  encorporated 

into 

the 

estimated 

time  derivative 

of 

^k- 

With  this  in  mind. 

ag 

and 

33  will 

be  set  to  zero 

in 

all 

applications  of  this  study. 

3.4.4.  Second  Order  Contributions 

All  of  the  terms  of  Eq.  3,19  have  been  described  with  the 
exception  of  those  terms  associated  with  The  Convergence 

term  35  controlls  the  amount  of  change  in  Aw;^  and 
Afin-  Therefore,  35  also  effects  the  time  derivative  of 
the  total  error,  as  seen  at  the  end  of  section  3.3.  Therefore, 
contributions  from  the  second  order  derivatives  and  time 
derivatives  are  being  implemented  when  35  is  activated.  If 
further  insight  is  required  in  understanding  the  various 
components  of  Eq.  3.19,  expansion  of  Eq.  3.7  may  help. 

3.5.  Final  Implementation  Stage 

Although  Eq.  3.19  is  the  equation  for  the  final  weight 
update  rule,  a  further  reduction  is  necessary  for  a  computer 
programmed  implementation.  Several  temporary  network  variables 
Will  be  defined  to  help  simplify  the  programming  overhead.  Each 
node  in  every  layer  is  responsible  for  performing  two  passes;  a 


3-15 


forward  pass  and  backward  pass.  Tbis  fact  will  be  used  in 
associating  the  temporary  variables  with  the  model  and  further 
reducing  the  programming  overhead. 

In  Eq.  3.19  there  are  several  partial  derivatives  which  must 
be  computed  and  reduced  to  a  form  acceptable  for  programming.  In 

other  words,  what  is  the  partial  of  fout  respect  to  w?  What 

is  the  second  partial  of  fout  respect  to  w  and  w"^  and  also  w 

and  fin*^?  These  partials  must  be  computed  in  order  to  simplify 

the  programming  model.  Appendix  C  is  devoted  solely  to  the 

computations  of  the  various  partials  of  the  sigmoid  function  and 
the  results  will  be  used  here. 

if  •  (  1  —  f  )  •  f 

out,  fc  out,  k  in,  k 

d2f 

- aui  if  .  (  1  -  f  )  •  I 

dwafT  k 
in 

*  ^out,k*f  ^  ~  ^out,  K  ^  “  2*fout^k  >**in,k’*'^ 


af 

_ QiXl 

aw 


a2f 

- Jhil  :  f  .  ( 

awaw"^  k  ** 


f  )  ♦  (  1  -  2.f  )  .f  .f'T 

out,  k  out,  k  in,  k  in,  k 


Applying  these  equations  to  the  algorithm  directly  would  be  quite 
cumbersome,  hence  the  temporary  variables.  Before  substituting 
the  above  equations  into  Eq.  3.19,  the  following  temporary 
variable  is  defined.  Let 

^  ^  out,  k  *  <  1  —  f  out,  k  > 
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wtiere  u  is  a  common  artifact  generated  when  computing  the  partial 
derivatives  of  fout*  Next,  substitute  the  definitions  of  each 
partial,  along  with  u,  into  Eq.  3.19,  such  that 

AWr+i  3  (  1  -  ai,  )*AWr  -  a2*WR 

♦  (  ai*etot  ♦  ®tot 

♦  e  •  u*  (  1  -  2«f  )  ‘f  ‘f*^  ‘W* 

tot  ^  out.  K  in,  K  in,  k  k 

+(U‘I*U‘(l-2‘f  )«f  •wT).f' 

out.  k  in  in.  k 


AWRti  i  -  ajt  >*Awr  -  ag'WK  ♦  (  aj-etot  ♦  ©tot  )-u**in.  k 


+  e  ♦  uMl-2*f  )«f  •  f"^  •w’+w'r*f’ 

tot  out,  k  in,  k  in,  k  k  in,  k 


♦ 


U‘f  ’ 

in, 


In  the  above  equation,  let 

T 

V  =  <in.  k*^k  ♦  w'^’^in,  k 

where  v  represents  the  time  derivative  of  the  product  of  the 
input  and  it's  corresponding  weight,  such  that 


Awr*i  -.  (  1  -  an  ).AWr  -  ag.w^  +  {  ai.e^-oi.  ♦  e^ot  I'^'fin.  k 


♦  e  •  u* {  1  -  2‘f  ) -f  ‘V  ♦  u*f ' 

tot  out,  K  in,  k  in,  k 
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Next,  define 


where  q  is  the  d  term  defined  In  sections  2.  3.  5  and  3.  4.  1  and  the 
weight  update  equation  becomes: 

Awk+1  =  (  1  -  ai»  )*Awk  -  a2‘WK  ♦  k 

♦  u.etot’(  i  -  2«fout,  k  k*v  ♦  k- 

Finally,  let 


r  =  u*|  etot  ♦  etot‘(  1  -  2*fout.k  >*v  > 
and  the  final  weight  update  equation  becomes: 

Awr+1  (  1  -  34  >-Awr  -  ag'WR 

♦  (  ai.q  ♦  r  )‘f  in.  k  ♦  q**in,  k-  3-  20 

Equation  3,20  represents  the  final  form.  Obviously,  each 
temporary  variable  will  be  computed  first  before  Eq.  3.20  is 
computed. 

3.5.1  Forward  Pass 

Parker  describes  the  flow  of  signals  in  two  directions  [8]. 
The  input  enters  the  bottom  of  the  network  and  flows  forward  with 
each  node  in  each  layer  computing  what  Parker  calls  function 
signals.  Function  signals  represent  the  cell  outputs  and  their 
respective  time  derivatives.  The  cell  outputs  become  the  cell 
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Inputs  in  the  layer  directly  above  the  cell  in  question.  The 
output  time  derivatives  become  the  input  time  derivatives  in  a 
similar  manner.  The  output  layer  produces  a  function  signal 
which  is  immediately  compared  to  some  desired  response.  The 
partial  derivative  of  the  squared  difference  between  the  desired 
and  actual  outputs  is  termed  the  error  signal.  The  error  signal 
is  treated  in  a  similar  manner  as  the  function  signals,  but  they 
are  back,  propagated.  In  addition,  the  time  derivative  of  the 
error  signals  is  computed  from  the  time  derivative  of  the  output 
and  propagates  with  the  error  signal. 

On  the  forward  pass  each  cell  in  each  layer  is  responsible 
for  computing  and  maintaining  it's  own  copy  of  the  following: 

1 

^out,  R  =  ^out,  in,  k>  ^k  >  -  - - - • 

1  ♦  exp(— f^  ‘Wv  ♦  0) 

in,  k 

u  :  ^out,  k'^  *  ~  ^out,  k  l> 
w’k  =  ag-WK  ♦  as-AWK. 

V  =  <In,  k-’^ic  ♦  in.  k 

The  threshold  (9)  is  the  cell  offset  and  is  updated  much  in  the 
same  way  the  weights  are  updated. 

3.5.2.  Backward  Pass 

As  mentioned  earlier,  the  cell  must  also  be  capable  of 
handling  the  backward  propagation  functions,  see  Fig.  3.1. 
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Figure  3.1  Signal  Flow  Through  a  Single  Cell  [Q] 

Therefore  on  the  backward  pass,  the  cell  must  compute  the  error 
signals  as  well  as  their  time  derivatives.  Figure  3.2  presents  a 
simple  two  layer  network  for  enhancing  the  discussion  below. 

Since  the  algorithm  used  for  this  thesis  effort  is  a 
supervised  backprop,  there  must  be  some  desired  response  from 
which  to  compare  the  actual  response.  The  output  of  each  node 
in  the  output  layer  will  be  used  to  compute  an  error  signal  used 
for  updating  the  weights.  Recall  that  the  function  of  backpi^op 
IS  to  back  propagate  the  partial  derivatives  of  the  quantity  to 
be  minimized.  First,  the  total  error  and  it's  approximate  time 
derivative  received  by  each  cell  are  computed,  where 
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Figure  3.2  Two  Layer  Network  Display 


E 


e 

in,  i 


r 


The  second  step  is  to  compute  the  output  error  signal  and 


it's  time  derivative  for  propagation  to  the  next  lower  layer. 
The  output  error  is  the  product  of  the  partial  and  the 
corresponding  weight  and  can  be  found  by. 

df 

e  :  e  • — :  e  ^f  •(  1  —  f  )  *w  =  q»w  , 

out  tot  jj  tot  out,  k  out,  K  k  k 

in 

where  the  results  of  appendix  C  and  the  temporary  variables  have 
been  applied.  The  partial  time  derivative  of  ©out  found  in 

much  the  same  way  and  applying  the  chain  rule,  such  that 

af  f  d2f  a^f  'V 

e •  :  e ’  •  — Qili.  +  e  .  - QUt —  , 4  - f » 

out  tot  tot  [  If  k  QfT  if  in.  k 

in  in  in  in 

=  r*WK  ♦  q^wif. 

Again,  see  appendix  C  and  the  temporary  variable  definitions  for 
clarity. 

Once  ©out  partial  time  derivative  have  been 

defined,  they  can  then  be  related  to  e^^^  cells  of  the  next 

lower  layer  and  the  process  is  repeated.  The  cell's  weights  may 
be  updated  layer  by  layer  during  backprop  or  the  results  may  be 
stored  and  updated  after  the  backward  pass  is  complete. 

NOTE:  For  ease  of  programming  and  efficiency  of  the  code,  each 

layer  in  the  multilayer  perceptron  will  be  treated  as  a  record, 
since  each  layer  has  common  attributes.  By  attributes,  it  is 
implied  that  each  layer  will  have  a  vector  of  outputs,  and  a 
matrix  of  weights.  Remember,  that  the  inputs  to  a  cell  can 
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originate  from  the  environment  or  from  the  other  cells  in  the 
layer  directly  beneath  it  as  outputs  from  that  layer.  Therefore, 
the  inputs  will  be  considered  vectors  as  well.  The  error  signals 
associated  with  each  cell  will  also  be  defined  as  a  vector 
describing  the  error  signals  of  a  given  layer. 

3.6.  Initial  Hetwork  Conditions 

A  great  deal  of  discussion  has  gone  into  the  implementation 
stages  of  the  second  order  back  propagation  algorithm.  However, 
the  questions  that  arise  are:  what  is  the  state  of  the  network 
when  training  is  initialized?  What  initial  values  are  assigned 
to  the  weights,  thresholds  and  input  partial  time  derivatives? 
This  section  addresses  each  of  these  questions. 

In  answering  the  question  of  the  initial  values  of  the 
weights  and  thresholds,  it's  desired  to  have  the  activation  level 
of  each  cell  roughly  equal  to  0.  Thus, 

T 

fin’^  ♦  e  s  0 

implies  that  each  cell  within  the  network  fires  at  approximately 


0.5. 

The 

reason 

this  is 

so  important,  is  that  if 

the 

output 

cells 

fire 

close 

to  1  or 

0  early  in  training  and 

the 

desired 

output  is  0  or  1  respectively,  the  network  may  never  recover  and 
successfully  train.  Consider  the  following  argument  given  a 
first  order  minimization  technique.  The  weight  update  rule  for 
the  gradient  of  steepest  descent  has  the  following  form  from 
section  3.4.1: 
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*^out,  K*  <  ^  “  ^out.k  ***  ~  ^out,  k  '‘^in.k* 

Consider  the  output  from  a  given  cell  in  the  output  layer  is  "i 
and  the  desired  is  equal  to  0.  The  desired  minus  the  output  is 
"-1,  but  the  other  difference  term  is  ~0.  Therefore,  under  these 
conditions  little  or  no  change  occurs  in  the  weights.  However, 
if  each  cell  in  the  network  is  initially  firing  near  0.5,  the 
network  is  provided  the  opportunity  to  learn. 

More  information  can  be  provided  by  examininig  the  second 
partial  of  s  with  respect  to  w  and  w"^.  The  underlying  idea  is  to 
examine  the  sign  of  definiteness  of  this  matrix  over  the  entire 
ensemble  of  training  vectors.  If  it  is  possible  to  show  that  the 
average  second  partial  matrix  is  positive  definite  over  the 
entire  input  ensemble,  then  this  would  imply  a  surface  with 
upward  concavity.  This  further  implies  a  global  minimum  over  the 
entire  ensemble  of  input  vectors  considered.  Appendix  D 
considers  this  for  a  single  cell  and  establishes  a  criterion  for 
initializing  training  in  a  neighborhood  of  the  global  minimum  for 
an  arbitrary  training  set.  The  result  is  provided  here; 


-ln(2  ) 


n 


w  •  f 
i :  1  ^ 


in. 


ln(2  ). 


3.  a  1 


It  is  implied  in  the  above  inequality  that  one  of  the  inputs  is 
equal  to  1,  corresponding  to  the  so  called  threshold. 

Information  about  the  input  may  reduce  the  above  equation  to 
strictly  a  function  of  the  weights.  For  instance,  the  FLIP 
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feature  vectors  were  normalized,  according  to  Eq.  2.1.  After 

normalization,  the  features  ranged  from  "1.5).  Selecting 

a  worst  case  scenario  where  all  the  features  of  a  given  vector 
equal  1.5  (or  -1.5),  the  above  criterion  reduces  to: 

n 

-O.  462  <  w  <0.  462. 

f^-l  ^ 

The  above  inequality  of  Eq.  3.21  provides  a  measure  as  to 

the  initial  weight  settings  for  a  given  cell.  For  instance,  the 
weights  could  be  set  randomly  and  uniformly  between  (-t,  t), 

where  t  is  a  small  floating  point  number,  if  some  information  is 

known  about  the  input.  A  test  could  then  be  performed  to  insure 
that  the  above  inequality  is  met.  Randomly,  setting  the  weights 

uniformly  between  (-0.45,  0,45)  for  the  input  data  in  this  study 

satisfied  the  above  criterion,  provided  by  Eq.  3.21. 

The  final  question  to  be  answered  considers  the  time 

derivatives  of  the  input  from  the  environment.  No  information 

was  provided  concerning  the  time  derivatives  of  the  input  from 

the  environment.  Since  the  algorithm  is  a  discrete  version  of 

the  second  order  linear  differential  equation,  it  will  be  assumed 

that  each  input  will  be  constant  over  the  period,  of  time  (At) 

in  question.  Therefore,  it  is  assumed  that  the  network  input 

time  derivatives  are  equal  to  0.  It  should  not  be  assumed, 

however,  that  the  input  to  the  hidden  layer  nodes  are  zero. 

Recall,  that  the  output  of  each  layer  becomes  the  input  of  the 
cells  in  the  above  layer.  It  is  understood  that  the  output  is 
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changing  with,  time,  as  per  section  3.4. 

3.7.  Summary 

The  first  several  sections  of  this  chapter  is  devoted  to  the 

impementation  of  Parker's  linear  differential  equation  in  Eq. 

2.8.  A  great  deal  of  text  was  devoted  towards  the  implementing 

stages  to  achieve  a  better  understanding  of  the  concepts  hidden 
within  the  mathematics.  Time  derivatives  of  the  signals  were 
derived  in  order  to  inform  the  network  about  how  the  performance 
surface  is  changing  with  time.  With  the  information  in  this 
chapter,  along  with  appendices  A,  B.  C.  and  D,  most  (if  not  all) 

of  the  concepts  have  surfaced.  Once  all  of  the  terms  of  the 
final  implementing  equation  had  been  derived,  it  was  necessary  to 
introduce  the  temporary  variables  to  ease  the  programming  effort. 
Finally,  initial  network  conditions  were  considered. 

The  following  chapter  provides  the  results  of  the  validation 
stage  of  the  algorithm  discussed  in  section  3.3.  Chapter  four 

begins  by  using  the  algorithm  derived  in  this  section  to  solve 
the  exclusive  or  problem.  The  latter  sections  test  the  algorithm 
on  the  doppler  imagery  which  has  already  been  classified  by  Ruck, 
to  further  validate  the  algorithm. 
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4.  Validation  of  Second  Order  Algorithm 


4.1.  Introduction 

In  the  last  chapter,  an  extensive  analysis  of  the  second 
order  (SO)  algorithm  implementation  was  realized.  In  this 
chapter,  it  is  desired  to  focus  on  the  validation  stage  of  the 
classifier  applying  this  new  algorithm.  The  validation  stage 
will  consist  of  two  parts.  The  first  involves  application  of  the 
SO  hack  propagation  algorithm  in  a  network  used  to  solve  the 
exclusive  OR  (XOR)  problem.  The  second  stage  initiates  the  quest 
of  pattern  classification  beginning  with  set  of  feature  vectors 


generated  from 

doppler 

imagery.  Ruck 

used 

these 

same  features 

with  moderate 

success 

[121. 

He  was  able  to 

attain 

near  perfect 

classification 

with 

the 

training 

data 

and 

roughly  75/ 

classification  < 

of  the 

test 

data  [12]. 

These 

facts 

add  validity 

to  the  input 

feature 

vectors.  The 

generalized 

second  order 

algorithm  developed  in  this  study,  will  allow  the  comparison 
between  first  and  second  order  techniques. 

The  next  section  begins  with  a  description  of  the  XOR 
problem  set  up.  The  input  and  learning  parameters  used  are 
provided  in  this  section,  as  well  as  the  convergence  results. 
Section  4.3  begins  the  pattern  classification  effort  for  this 
study.  Within  this  section,  the  input  feature  vectors  are 
described  and  formatted.  In  addition,  the  network  architecture 
and  learning  parameters  are  described  for  the  gradient  of 


steepest  descent,  momentum,  and  second  order  methods. 


The 


results  of  the  pattern  classification  of  the  doppler  imagery 
feature  vectors  follow. 

4.2.  The  Exclusive  OR  Problem 

The  results  of  this  section  are  basically  a  reproduction  of 
the  results  generated  by  Parker  [8].  Since  the  algorithm  used 
for  this  study  generalizes  to  the  first  order  methods  of 
steepest  decent  and  momentum,  the  problem  will  be  attempted  using 
both  second  and  first  order  techniques.  The  results  will  be 
formulated  in  a  table  based  on  the  number  of  iterations  until 
convergence. 

4.2.1  Input  Data  and  Network  Parameters 

The  idea  is  to  train  the  network  on  a  fixed  set  of  inputs 
which  are  listed  in  Table  4.1.  along  with  the  desired  responses. 
The  inputs  will  be  shown  to  the  network  as  in  Fig.  4,1, 
iteratively  until  the  desired  outputs  are  obtained.  The  output 
will  be  measured  indirectly  by  monitoring  the  error.  The  error 
is  defined  as  the  difference  between  the  desired  output  and  the 
actual  output.  Therefore,  the  criterion  used  in  validating  the 
model  will  be  the  error.  By  minimizing  the  error,  the  squared 
error  will  surely  follow. 


Table  4.1  Input  Pattern  Vectors  and  Desired  Response  for  XOR 


Vector 

Bill 

BPRI 

Desired  Response 

1 

0.  1 

0.  1 

0.  1 

2 

0.  9 

0.  1 

0.  9 

3 

0.  1 

0.  9 

0.  9 

4 

0.  9 

0.  9 

0.  1 

Figure  4.1  XOR  Network  Architecture 


The  initial  weight  values  were  set  to  small  random  numbers 


criterion  provided  by  Eq.  3.21.  Tbe  contents  of  Table  4.2  list 


the  values  of  the  learning  parameters  used  in  solving  this 
problem. 

An  extensive  search  of  the  optimum  learning  parameters  was 
not  performed.  It  was  only  desired  to  prove  that  the  network  and 
it's  SO  back  propagation  algorithm  could  solve  the  XOR  problem. 
By  solving  the  problem,  it  is  implied  that  the  network  found  the 
optimum  path  for  the  weights  to  follow.  When  the  error  is 


Table  4.2  Learning  Parameter  Values 


Method 

a 

5 

Gradient 

0.  1 

0.  0 

0.  0 

1.  0 

0.  0 

Momentum 

0.  1 

o 

6 

0.  0 

0.  1 

0.  0 

Second  Order 

lomi 

0.  0 

0.  0 

mom 

0.  05 

reduced  to  some  predetermined  criterion,  the  network  concludes 
it's  training,  and  the  optimum  weight  values  are  obtained.  The 
error  criterion  or  the  difference  between  the  desired  and  actual 
output  was  set  to  0.1.  The  criterion  must  be  met  or  surpassed  on 
four  successive  iterations,  allowing  the  network  the  opportunity 
to  classify  all  four  inputs  and  meet  the  criterion. 

4.2.2  Convergence  Results 

The  results  tabulated  in  Table  4.3  show  that  the  network 
weights  found  the  optimum  path  for  convergence.  This  implied 
that  the  network  can  perform  the  XOR  logic  function.  The 


average  number  of  iterations  were  generated  from  20  test  runs. 


Table  4.3  Comparison  Between  First  and  Second  Order  Techniques 


Method 

Gradient 

Momentum 

Second  Order 

Average  Number 
of  Iterations 

>20, 000 

5474 

5054 

The  results  show  that  on  average,  the  SO  method  slightly  out 
performed  the  momentum  method.  In  addition,  both  methods  greatly 
exceeded  the  performance  of  the  gradient  of  steepest  descent 
method.  The  gradient  of  steepest  descent  method  was  extremely 
slow  in  learning  and  terminated  after  20,000  iterations,  with  the 
error  slowly  decreasing. 

The  results  appear  to  be  very  promising  for  extending  the 
application  of  the  SO  approximation  to  more  difficult  problems. 
The  following  section  addresses  such  a  problem.  The  fact  that 
the  SO  approximation  method  exceeded  the  performance  of  the  first 
order  methods,  in  no  way  suggests  that  it  will  exceed 
performances  on  more  difficult  problems.  In  particular,  when 
considering  the  problem  of  pattern  classification  where  the 
inputs  may  be  great  in  number.  Other  considerations  along  this 
same  line,  are  the  number  of  layers  required,  the  number  of  nodes 
and  ultimately  the  number  of  weights  required  to  solve  the 
problem  of  machine  recognition  of  images. 

The  ADA  programming  code  used  in  implementing  the  XOR 


algoritliin  is  found  in  appendix  F. 


4.3.  Classification  of  Doppler  Imagery 

In  the  last  section  it  was  shown  that  the  SO  approximation 
method  could  in  fact  be  used  in  solving  the  XOR  problem.  The 
results  represent  a  promising  indication  that  the  applications 
may  be  extended  using  this  method.  Therefore,  in  this  section, 
features  extracted  from  doppler  imagery  will  be  used  as  the  input 
to  a  multilayer  perceptron.  The  multilayer  perceptron  will  apply 
the  SO  back  propagation  method,  as  well  as  the  first  order 
methods  for  comparison. 

The  following  subsection  describes  the  features  in  a  little 
more  detail.  Next,  the  specifics  of  the  network  architecture  are 
discussed,  followed  by  some  comments  on  the  values  used  for  the 
learning  parameters.  The  final  subsection  discusses  the  results 
and  makes  a  comparison  between  the  first  and  second  order  back 
propagation  techniques. 

4.3.1  Input  Feature  Data 

The  features  extracted  from  the  doppler  imagery  consisted  of 
normalized  moment  invariants.  To  the  network,  the  features  were 
actually  a  set  of  vector  components  of  normalized  moments.  Each 
vector,  or  example  of  a  target,  consisted  of  22  features,  and  in 
general,  the  final  version  of  a  machines  representation  of  an 
object  in  an  image.  The  targets  to  be  classified  included  tanks 
at  four  different  aspect  angles,  jeeps,  2.5  ton  trucks  and 
petroleum,  oil,  and  lubricant  (POL)  tankers.  The  data  base  of 


target  feature  vectors  heavily  favored  the  tanks.  Table  4.4 


breaks  down  the  number  targets  considered  for  each  class. 


Table  4.4  Target  Data  Base  for  Classification 


Class 

#  Training  Samples 

#  Test  Samples 

Tank 

43 

17 

POL 

4 

2 

Jeep 

6 

3 

Truck 

4 

2 

Roughly  two  thirds  of  the  available  feature  vectors  were  used  in 
training  the  network,  while  the  remaining  vectors  were  used  for 
testing  the  network  once  trained. 

4.3.2.  MetworK  Architecture  and  Learning  Parameters 

The  multilayer  perceptron  will  consist  of  three  layers, 
which  will  accept  22  inputs  and  output  4  classes.  Table  4.5 

describes  the  network  architecture  in  some  detail.  Table  4.6 
provides  the  learning  parameters  used  for  classifying  the  doppler 
imagery  feature  vectors.  Several  different  combinations  of 
learning  parameters  were  used  for  training.  This  search  was  not 

exhaustive,  since  there  Is  an  enormous  number  of  these 

combinations.  However,  the  parameters  listed  in  Table  4.6 

provided  the  best  combination,  of  those  tried,  as  far  as 
classification  accuracy  and  error  performance  were  concerned. 


m 


d 
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Table  4.5  Network  Architecture  Data 


Number  of  Features 

22 

Layer  One  Nodes 

20 

Layer  Two  Nodes 

6 

Number  of  Classes 

4 

Table  4.6  Network  Training  Data 


Parameter 

Gradient  Method 

1 

Momentum  Method 

SO  Method 

a 

1 

0.  3 

0.  3 

0.  3 

a 

.  2 . 

0.  0 

0.  0 

0.  0 

a 

3 

0.  0 

0.  0 

0.  0 

a 

4 

1.  0 

0.  1 

0.  1 

a 

5 

0.  0 

0.  0 

0.  1 

Number  of 
Iterations 

60.  000 

60.  000 

60,  000 

Data  Output 
Interval 

2,  000 

2.  000 

2,  000 

The  text  to  follow  provides  the  results  generated  from 
target  classification  of  the  doppler  imagery.  Average 
classification  accuracy  and  the  average  total  output  error  is 
provided,  along  with  the  network  performance  on  individual 
classes. 
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4.3.3.  Classification  Results 

Tne  grapns  depicted  below  are  the  results  from  the 
classification  of  the  feature  vectors  generated  from  doppler 
imagery.  The  results  are  displayed  in  terms  of  average 
classification  accuracy  versus  the  number  of  iterations  for  both 
the  first  and  second  order  methods.  The  log  of  the  average  total 
output  error  versus  the  log  of  the  number  of  iterations  was 
measured,  as  well.  In  addition,  a  typical  instance  of 
classification  accuracy  for  each  class  is  listed  below.  An 
instance  implies  that  the  data  was  not  averaged.  The  graphs 
below  are  presented  in  order  of  test  results  rather  than  by 
method,  for  ease  of  comparing  each  method. 

4.3.3.1  Average  Classification  Accuracy 

The  average  classification  accuracy  was  taken  from  lO 

complete  passes  through  the  network,  since  it  was  desired  to 
obtain  an  average  network  performance.  Given  the  randomness  of 
the  initial  state  of  the  network,  the  network  does  not  perform  in 
exactly  the  same  way  with  each  training  attempt.  Each  pass 
through  the  network  will  re-inltialize  the  network  parameters  and 
begin  training  all  over  again.  The  specific  characteristics 
desired  for  comparison  were  the  convergence  rate  and  stability  of 
each  method. 

The  criterion  used  in  determining  a  correct  response,  and 

thus  the  accuracy  of  the  classifier,  was  based  on  the  actual 

output  values  of  the  nodes  in  the  output  layer.  The  desired  node 


output  for  a  correct  classification  is  1,  while  all  the  other 


ADvano: 


node  outputs  are  0.  Therefore,  the  criterion  defines  the  output 
response  of  the  desired  node  to  fire  at  0,6  or  above,  while  the 
other  nodes  fire  at  0.2  or  less. 

Figures  4.2,  4.3  and  4.4  display  the  average  training 

accuracy  of  the  gradient  of  steepest  descent,  momentum  and  second 
order  methods  respectively.  Figures  4.5,  4.6  and  4.7  display  the 
average  test  accuracy. 


NUMBER  OF  ITERATIONS 

AVERAGE  NETWORK  TRAINING  PERFORMANCE  FOR  GRADIENT  METHOD 


Figure  4.2  The  networK  achieved  roughly  92X  accuracy  on  training 
data. 


ACCURACY 


I 


Figure  4.4  The  network  achieved  90’/  accuracy  on  training  data. 
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accuracy  accuracy 


! 


Figure  4.5  The  network  achieved  75X  accuracy  on  the  test  data. 


Figure  4.6  The  network  achieved  78X  accuracy  of  test  data. 
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NUMBER  OF  ITERATIONS 

AVERAGE  NETWORK  TESTING  PERFORMANCE  FOR  SECOND  ORDER  METHOD 

Figure  4.7  Tlie  network  acliieved  7&Y.  accuracy  of  test  data. 

The  momentum  and  second  order  methods  slightly  exceeded  the 
performance  of  the  gradient  of  steepest  descent  method  on  the 
test  and  training  data.  On  the  average,  there  was  little 
difference  between  the  momentum  and  second  order  methods. 
However,  close  examination  of  the  average  classification  accuracy 
reveals  that  the  second  order  method  initially  converges  slightly 
quicker  than  the  momentum  method.  Over  the  last  10,000 
iterations  the  momentum  method  seemed  to  settle  down  and  provide 
a  consistent  accuracy.  On  the  other  hand,  the  second  order 
method  continued  to  climb,  but  in  a  slightly  erratic  manner. 

4.3.3.2  Average  Total  Output  Error 

This  section  presents  the  results  of  the  average  total 
output  error  of  the  network.  The  error  was  defined  to  the 


magnitude  of  the  difference  between  the  desired  output  and  the 


actual  output.  Tlie  total  output  was  merely  the  error  sum  of  all 
of  the  output  nodes.  The  total  output  error  was  averaged  over 
the  entire  set  of  training  (or  testing)  input  feature  vectors, 
and  ultimately  over  each  pass  through  the  network.  The  log  of 
the  average  error  was  graphed  versus  the  log  of  the  number  of 
iterations.  It  was  desired  to  determine  what  trends,  if  any,  the 
average  error  displayed  for  both  the  training  and  test  data  sets. 
Again,  Figs.  4.8,  4.9  and  4.10  reflect  the  training  results  of 

the  gradient,  momentum  and  SO  methods,  respectively.  Figures 
4.11,  4.12  and  4.13  display  the  results  of  the  test  data. 


10000 

L0G(NUMBER  OF  ITERATIONS) 

AVERAGE  NETWORK  TRAINING  PERFORMANCE  FOR  GRADIENT  METHOD 


Figure  4.8  Note  the  smooth  (almost  monotonic)  decreasing  error. 
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LOG(ERROR) 


LOG(NUMBER  OF  ITERATIONS) 

AVERAGE  NETWORK  TRAINING  PERFORMANCE  FOR  MOMENTUM  METHOD 

Figure  4.9  The  initial  error  is  dropping  off  smoothly, 
becomes  erratic  as  training  continues. 


10000 

L0G(NUMBER  OF  ITERATIONS) 

AVERAGE  NETWORK  TRAINING  PERFORMANCE  FOR  SECOND  ORDER  METHOD 

Figure  4.10  The  initial  error  is  dropping  off  smoothly, 
becomes  erratic  as  training  continues. 


LOG(NUMB£R  OF  ITERATIONS) 

AVERAGE  NETWORK  TESTING  PERFORMANCE  FOR  GRADIENT  METHOD 

Figure  4.11  Note  the  initial  smooth  descent.  In  addition,  notice 
that  the  relative  flat  region  in  the  middle  and  then  a  slightly 
erratic  descent  over  the  last  30,000  iterations. 


LOG(NIjMBER  of  ITERATIONS) 

AVERAGE  NETWORK  TESTING  PERFORMANCE  FOR  MOMENTUM  METHOD  • 

Figure  4.12  The  initial  error  drops  much  quicker  then  the 
gradient  method  and  becomes  erratic  as  training  continues. 
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LOG(NUMBER  OF  ITERATIONS) 

AVERAGE  NETWORK  TESTING  PERFORMANCE  FOR  SECOND  ORDER  METHOD 

Figure  4.13  The  initial  error  drops  much  quicker  then  the 
gradient  method  and  becomes  erratic  as  training  continues. 

Again,  the  momentum  and  second  order  methods  slightly  exceed 
the  average  error  performance  of  the  gradient  of  steepest  descent 
method.  In  addition,  the  momentum  and  second  order  methods  for 
the  most  part  performed  on  a  comparable  basis.  Over  the  first 
half  of  training,  the  error  decreases  rather  smoothly,  after 
which  the  error  behavior  of  the  momentum  and  second  order  methods 
becomes  erratic.  This  may  be  explained  by  the  following 
argument.  If  the  minimum  of  the  error  surface  lies  in  a 
relatively  flat  hyperplane,  then  there  could  be  many  solutions 
(optimum  weight  values)  for  convergence.  Since  the  instantaneous 
error  surface  is  changing  with  each  input,  the  network  is  simply 
trying  to  find  an  exact  solution.  This  may  be  interpreted,  as 
the  network  attempting  to  memorize  the  input  data.  The 
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classification  results  of  the  FLIP  imagery  in  the  following 


chapter  further  support  this  idea.  The  gradient  of  steepest 
descent  method  has  a  little  more  stability  over  the  last  half  of 
training,  as  shown  in  Figs.  4.8  and  4.11.  The  reason  for  this, 
is  that  the  weights  are  being  updated  by  a  small  constant 
proportion  of  the  partial  derivative.  This  is  slow  gradual 
process  and  the  network  has  not  quite  converged  to  the  minimum; 
it's  still  learning. 


4. 3.3. 3  Target  Accuracy 


The 

data 

in  this 

section  reflect 

a 

typical 

instance 

of  the 

actual  target 

accuracy 

provided 

by 

all 

three 

methods. 

Again 

training 

and 

test  data 

results 

are 

displayed 

the  tables 

below. 

The  data  gathered  and  displayed  in  the  tables  below  were  actually 
generated  from  the  network  output.  The  tables  are  presented  in 
the  form  of  a  Network  Confusion  Matrix.  Not  only  was  it  desired 
to  determine  the  actual  target  accuracy,  but  this  data  may 
provide  some  insight  as  to  the  worth  of  the  feature  vectors 
discriminating  the  various  targets. 


Table 

4.7,  4.8, 

and 

4.9 

list 

the 

results 

from 

the  training 

data,  while 

Table 

4.10, 

4.11, 

and 

4.12 

display 

the 

results  form 

the  test  data. 
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Table  4.7  Training  Data  Confusion  Matrix  for  Gradient  Method. 
I  Reads  row  by  row  left  to  right.  The  network  failed  to  classify 

the  following  number  of  feature  vectors:  a  tanks  and  1  each  POL, 
Jeep,  and  truck. 


Class 

Tank 

POL 

Jeep 

Truck 

Accuracy 

Tank 

41 

0 

0 

0 

95.  3-/. 

POL 

0 

3 

0 

0 

75/. 

Jeep 

0 

0 

5 

0 

83.  37. 

Truck 

0 

0 

0 

3 

75X 

Table  4.8  Training  Data  Confusion  Matrix  for  Momentum  Method. 
Reads  row  by  row  left  to  right.  The  network  failed  to  classify  i 
POL  feature  vector. 


Class 

Tank 

POL 

Jeep 

Truck 

Accuracy 

Tank 

43 

n 

0 

0 

lOOX 

POL 

0 

3 

0 

0 

75X 

Jeep 

0 

0 

6 

0 

lOOX 

Truck 

0 

0 

0 

4 

lOOX 
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Table  4.9  Training  Data  Confusion  Matrix  for  Second  Order  Method. 
Reads  row  by  row  left  to  right.  The  network  failed  to  classify  1 
POL  feature  vector. 


Class 

Tank 

POL 

Jeep 

Truck 

Accuracy 

Tank 

43 

0 

0 

0 

100/. 

POL 

0 

3 

0 

0 

75/. 

Jeep 

0 

0 

6 

0 

100/. 

Truck 

0 

0 

0 

4 

100/ 

1 


The  training  results  in  all  the  tests  continue  to  enhance  the 
idea,  that  if  a  network  is  trained  long  enough,  it  will  learn  or 
at  the  very  least  memorize  the  input  training  set. 


Table  4.10  Test  Data  Confusion  Matrix  for  Gradient  Method.  Reads 
row  by  row  left  to  right.  Notice  that  two  Jeep  feature  vectors 
were  classified  as  a  tank,  while  the  other  was  classified  as  a 
truck. 


Class 

Tank 

POL 

Jeep 

Truck 

Accuracy 

Tank 

16 

0 

0 

0 

94.  1/ 

POL 

0 

0 

0 

0 

0/ 

Jeep 

2 

0 

0 

1 

0/ 

Truck 

_ 

0 

_ 

0 

0 

_ 

1 

_ 

50/ 

Table  4.11  Test  Data  Confusion  Matrix  for  Momentum  Method. 
Reads  row  by  row  left  to  right.  Notice  that  two  jeep  feature 
vectors  were  classified  as  a  tank,  while  the  other  was  classified 
as  a  truck. 


Class 

Tank 

POL 

Jeep 

Truck 

Accuracy 

Tank 

17 

0 

0 

0 

lOOX 

POL 

0 

0 

0 

0 

0/. 

Jeep 

2 

0 

0 

1 

OX. 

Truck 

0 

0 

0 

2 

lOOZ 

Table  4.12  Test  Data  Confusion  Matrix  for  Second  Order  Method. 
Reads  row  by  row  left  to  right.  Notice  that  two  jeep  feature 
vectors  were  classified  as  a  tank,  while  the  other  was  classified 
as  a  truck. 


Class 

Tank 

POL 

Jeep 

Truck 

Accuracy 

Tank 

17 

0 

0 

0 

100/. 

POL 

0 

0 

0 

0 

OX 

Jeep 

2 

0 

0 

1 

OX 

Truck 

0 

0 

0 

2 

lOOX 

It  should  not  be  too  surprising  to  observe  that  the  second 
order  and  momentum  methods  perform  slightly  better  than  the 


gradient  of  steepest  descent  method,  given  the  earlier  results. 


The  gradient  of  steepest  descent  method  will  eventually  reach  the 


performance  levels  of  the  other  two  methods  given  more  training 
iterations. 

The  Confusion  Matrices  above  reinforce  the  results  provided 
by  Ruck  [12].  The  results,  in  all  of  the  Test  Data  Confusion 
Matrices,  confirm  Ruck's  original  hypothesis  [12].  The  small 
number  of  training  features  (other  than  tanks)  did  not  provide 
enough  information  for  the  network  to  properly  segment  the  input 
decision  space.  However,  even  though  the  second  order  algorithm 
performed  as  well  as  the  momentum  method,  it  did  not  provide  any 
improvements. 

4.4.  Summary 

The  first  stage  of  validation  was  to  show  that  the  SO 
algorithm  proved  successful  in  solving  the  XOR  problem.  The 
proof  basically  duplicated,  as  well  as  verified  the  results 
found  by  Parker  [8].  The  SO  approximation  not  only  solved  the 
XOR  problem,  but  provided  faster  convergence  on  the  average.  The 
ability  of  the  SO  algorithm  to  classify  feature  vectors 
generated  from  doppler  imagery,  hinted  that  the  algorithm  could 
be  used  on  other  types  of  classification  features.  It  was  found 
that  the  momentum  and  second  order  methods  slightly  exceeded  the 
performance  of  the  gradient  of  steepest  descent  method. 
Furthermore,  the  second  order  and  momentum  methods  performed  on  a 
comparable  level. 

A  discussion  of  the  general  results  of  this  chapter,  and 


those  in  chapter  5  will  be  entertained  in  chapter  6,  within  the 
discussions  section.  The  next  chapter  is  devoted  to  the 
classification  of  features  generated  from  Forward  Looking 
Infrared  Imagery. 
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.  Classif Ication  Results  of  Forward  LooKing  Infrared.  Imagery 

« 

5.  1.  Introduction 

The  results  of  chapter  4  conclude  that  the  second  order 
minimization  technique  proved  to  be  relatively  successful.  This 
chapter  concerns  various  classifications  of  features  generated 
from  forward  looking  infrared  (FLIR)  imagery.  As  mentioned 
earlier  in  chapter  two,  other  types  of  features,  as  well  as 
moment  invariants,  will  be  considered  for  classification. 

The  next  section  of  this  chapter  deals  directly  with  the 
classification  of  those  features  generated  for  comparison  with 
the  Bayesian  classifier.  This  classification  effort  will  be 
based  on  target  (TGT)  and  non-target  (NT)  recognition.  The 
features  selected  for  classification  were  the  normalized  versions 
of  the  blob  length  to  width  ratio,  blob  relative  mean  intensity, 
and  blob  standard  deviation  of  the  intensity  (section  2.E. 2). 

The  following  section  concerns  the  classification  of  the 
moment  invariant  feature  vectors.  The  same  comparisons  drawn  for 
the  doppler  imagery  in  chapter  4,  will  be  used  again  in  this 
section  for  the  FLIR  imagery. 

5.  2.  Target  and  Non-Target  Feature  Classification 

There  were  many  feature  vectors  available  for 
classification  and  approximately  75'/.  of  each  class  was  used  for 
training.  Of  the  619  feature  vectors  available  for  input,  615 


were  used  for  training;  the  others  made  up  the  testing  data  base. 


5.  2.  1  Input  Feature  Data 

Objects  making  up  the  TQT  class  consisted  of  tanks  (TA) , 
trucks  (TR),  APCs  (AP),  and  jeeps  (CJ).  There  were  also  several 
features  generated  from  the  combination  of  a  tank  and  jeep  (TC). 
The  two  targets  were  too  close  together  to  be  resolved  by  the 
segmentation  process.  Table  5.  1  breaks  down  the  number  of 
samples  provided  by  each  TGT,  as  well  as  providing  the  number  of 
NT  samples,  for  both  training  and  testing. 

Table  5.  1  TGT  and  NT  Sample  Breakdown 


Class 

#  Training  Samples 

#  Testing  Samples 

TA 

60 

17 

TR 

80 

25 

AP 

85 

28 

CJ 

25 

10 

TC 

15 

6 

TGT  Total 

265 

88 

NT 

350 

_ 

116 

The  raw  features  generated  from  the  FLIR  imagery,  consisted  of  a 
wide  range  of  values.  In  order  to  prevent  the  larger  valued 
features  from  biasing  the  network,  a  normalization  scheme  was 
required.  An  attempt  at  computing  a  linear  normalization  scheme, 
placing  all  the  data  within  the  unit  hypercube,  failed  to  produce 
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desired  results.  The  network  failed  miserably  when  attempting  to 


train  on,  and  classify  the  feature  vectors.  This  failure 
prompted  an  attempt  at  another  normalization  scheme.  The 
training  feature  vectors  were  normalized  to  a  0  mean  vector  and  a 
standard  deviation  vector  of  1,  as  described  in  section  2.S.I. 
This  normalization  scheme  proved  to  be  much  more  successful  then 
the  previous  scheme  and  the  results  are  provided  below  in  section 
5.2.3. 

The  notion  that  the  first  normalization  scheme  failed  and 
the  second  was  somewhat  successful,  may  provide  some  insight  as 
the  function  of  neural  net  classifiers.  It  appears  that  the 
network,  not  only  cares  about  the  magnitude  of  each  vector,  but 
in  addition  cares  a  great  deal  about  the  angle  between  vectors. 
It  appears  from  this  argument  that  the  network  is  functioning  as 
a  nearest  neighbor  classifier. 

5.2.2.  Network  Architecture  and  Learning  Parameters 

The  network  architecture  used  for  classifying  this  set  of 
feature  vectors  is  described  in  Table  5.2.  Table  5.3  describes 
the  network  training  data  for  all  three  minimization  techniques. 


Table  5.2  Network  Architecture  Data 


Table  5.3  Network  Training  Data 


~  - - - - 1 

Parameter 

Gradient  Method 

Momentum  Method 

SO  Method 

a 

1 

0.  25 

0.  25 

0.  25 

a 

.  2 

0.  0 

0,  0 

0.  0 

a 

3 

0.  0 

0.  0 

0.  0 

a 

4 

1.  0 

0.  3 

0.  3 

a 

5 

0.  0 

0.  0 

0.  1 

Number  of 
Iterations 

200,  000 

200,  000 

200, 000 

Data  Output 
Interval 

4,  000 

4,  000 

4,  000 

The  weights  and  thresholds  were  initialized  to  values  within  the 
interval  (-0.45,  0.45)  using  a  uniform  random  number  generator. 

5.2.3.  Classification  Results 

This  portion  of  the  text  provides  the  classification  results 
of  the  neural  net  classifier.  The  instantaneous  classification 
accuracy  versus  the  number  of  iterations  is  provided  below.  Due 
to  the  large  number  of  input  training  vectors  and  the  enormous 
number  of  training  iterations  required,  the  results  were  not 
averaged.  The  log  error  is  also  considered,  along  with  the 
tally  on  the  TGT  and  NT  individual  accuracies.  Again, 
comparisons  between  the  different  methods  will  be  drawn  for  both 
the  training  data  and  testing  data.  In  addition,  the  performance 


results  of  the  Bayesian  classifier  are  presented  in  this  section 


as  well. 


5.2.3.1  Instantaneous  Classification  Accuracy 

For  the  neural  net  classifier,  the  same  criterion  as  the  one 
used  in  section  4. 4. 3.1  is  used  again  here.  The  desired  output 
node  must  fire  greater  than  or  equal  to  0.8,  while  the  remaining 
nodes  fire  at  0.2  or  less.  By  using  this  criterion,  the  neural 
network  classifier  has  somewhat  of  a  disadvantage,  when  compared 
to  the  Bayesian  classifier.  Recall  that  the  Bayesian  classifier 
uses  maximum  a  posteriori  decision  criterion.  This  is  a  more 
lenient  criterion  than  the  one  placed  on  the  neural  network 
classifier.  Appendix  E  considers  a  comparison  between  the  two 
classifiers,  given  a  more  lenient  criterion  placed  on  the  neural 
net  classifier. 

The  results  of  each  method  follows,  with  the  training  data 
first,  followed  by  the  testing  data.  The  results  from  the 
Bayesian  classifier  are  provided  next  and  comparisons  between  the 
two  classifiers  conclude  this  subsection.  Figures  5.1,  5.2,  and 

5.3  consider  the  training  data,  while  Figs.  5.4,  5.5,  and  5.6 
consider  the  test  data. 
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ACCURACY  0  3^  ACCURACY 


ACCURACY 


NUMBER  OF  ITERATIONS 

NETWORK  TRAINING  PERFORMANCE  FOR  SECOND  ORDER  METHOD 

Figure  5.3  The  network  achieved  over  67/  accuracy  on  training 
data.  The  second  order  method  reached  this  accuracy  faster  than 
the  other  two  methods 


NUMBER  OF  ITERATIONS 

NETWORK  TESTING  PERFORMANCE  FOR  GRADIENT  METHOD 

Figure  5.4  The  test  data  accuracy  was  poor,  reaching  only  62/ 


ACCURACY 


NUMBER  OF  ITERATIONS 

NETWORK  TESTING  PERFORMANCE  FOR  MOMENTUM  METHOD 


Figure  5.5  The  network  averaged  roughly  60/  accuracy  over  the 
last  30,000  Iterations  Not  much  improvement  over  the  gradient 
method. 


0  30000  60000  90000  120000  150000  180000  210000 

NUMBER  OF  ITERATIONS 

NETWORK  TESTING  PERFORMANCE  FOR  SECOND  ORDER  METHOD 

Figure  5.6  The  network  averaged  over  65/  accuracy  over  the  last 
30,000  Iterations,  revealing  a  distinct  improvement. 


Again,  me  momentum  and  second  order  metliods  continue  to 
exceed  the  performances  of  the  gradient  of  steepest  descent 
method,  but  only  slightly.  The  differences  between  the  momentum 
and  second  order  techniques,  again  are  minimal.  However,  close 
examination  of  the  data  reveals  that  the  second  order  method 
begins  to  exceed  the  performance  of  momentum  method  at  30,000 
iterations  for  the  training  data.  The  second  order  method  also 
has  a  distinct  advantage  over  the  momentum  method  during  the  last 
30,000  iterations.  These  same  observations  are  carried  over  into 
the  test  data  results  once  the  network  reaches  60,000  iterations. 
Although  the  results  as  a  whole  were  not  terribly  exciting,  the 
fact  that  the  second  order  method  provided  better  accuracy,  in 
fewer  number  of  iterations  is  significant.  However,  the  network 
performance  has  not  been  averaged  and  these  results  should  not  be 
taken  out  of  context.  More  testing  is  required,  such  that  the 
network  performance  may  be  averaged.  In  addition,  it  would  be 
desirable  to  increase  the  number  of  features. 

5.2.3.2.  Average  Total  Output  Error 

The  same  criterion  for  determining  the  average  output  error 
in  section  4. 4.3.2  is  used  here.  The  average  total  output  error 
was  measured  only  for  the  training  data  of  all  three  methods. 
The  training  error  is  given  so  that  the  decreasing  trends  may  be 
observed  and  verified. 
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LOG(ERROR) 


LOG(NUMBER  OF  ITERATIONS) 

NETWORK  TRAINING  PERFORMANCE  FOR  GRADIENT  METHOD 

Figure  5.7  Notice  that  the  error  plot  is  much  smoother  than  in 
Figs  5.8  and  5.9. 


LOG(NUMBER  OF  ITERATIONS) 

NETWORK  TRAINING  PERFORMANCE  FOR  MOMENTUM  METHOD 

Figure  5.8  Notice  the  smooth  initial  reduction.  Although  rough 
in  the  latter  training  stages,  the  error  continues  to  decrease. 


LOG(NUMBER  OF  ITERATIONS) 

NETWORK  TRAINING  PERFORMANCE  FOR  SECOND  ORDER  METHOD 

Figure  5.9  Although  rough,  the  error  continues  to  decrease  over 
the  course  of  training. 

Here  again,  the  momentum  and  second  order  methods  are 
slightly  more  successful  than  the  gradient  of  steepest  descent 
method.  As  noted  earlier,  there  is  still  minimal  differences 
between  the  momentum  and  second  order  methods.  The  error  was 
extremely  unstable,  but  decreasing  none  the  less.  Averaging  the 
data  would  have  smoothed  the  results  quite  a  bit.  Again,  further 
testing  is  required. 

Notice  that  the  roughness,  in  later  training  iterations, 
appears  more  significant  with  the  second  order  method,  than 
either  of  the  other  two  methods.  In  turn,  the  momentum  method  is 
rougher  than  the  gradient  method.  Again,  part  of  this  would  be 
suppressed  by  averaging.  However,  the  results  may  further 
suggest  another  reason.  The  choice  of  learning  parameters  may 


5-11 


Tlie  following  tabulated  results  demonstrate  tbe  comparisons 
of  classification  accuracy  between  tbe  statistical  Bayesian 
Classifier  and  the  neural  net  classifier.  The  results  are  broKen 
down  in  Table  5.4.  Each  method  of  the  neural  net  classifier  is 
considered,  after  the  network  had  achieved  200,000  training 
Iterations.  The  results  provided  below  are  just  typical 
instances  of  each  method;  the  results  have  not  been  averaged. 
Again,  keep  in  mind  that  the  criteria  used  for  each  classifier  is 
different,  with  neural  net  classification  criterion  being  much 
more  stringent.  Appendix  E  demonstrates  a  more  comparable 
criterion. 


Table  5.4  Classification  Accuracy  of  Neural  Net  Classifiers 
versus  the  Bayesian  Classifier,  (i)  Gradient  Method,  (2)  Momentum 
Method,  (3)  Second  Order  Method,  (4)  Bayesian. 


► 


Accuracy 

Class 

(l) 

(2) 

(3) 

(4) 

TGT 

Training 

73.  6X 

81.  4X 

82.  8X 

80.  3X 

NT 

Training 

S3.  1/. 

87.  0/. 

66.  6/. 

70.  6X 

TGT 

Testing 

57.  9/. 

61.  4Z 

64.  8X 

76.  9X 

NT 

Testing 

57.  8/. 

62.  2/. 

64.  7X 

74.  7X 
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The  neural  net  classifier  exceeds  the  performance  of  the 


Bayesian  on  the  training  data  and  the  roles  are  reversed  for  the 


test 

data.  It 

is 

believed. 

that  by  increasing 

the 

number  of 

features,  this 

may 

provide 

more  information  to 

the 

classifiers, 

and 

especially 

to 

the  neural  net  classifier. 

The 

neural  net 

classifier  learns  by  example,  more  examples,  then  more 
information  is  provided  to  the  net.  In  other  words,  the  net 
learns  more  about  its  environment.  Thus,  the  net  would  have  more 
of  an  opportunity  to  extract  the  essence  of  a  particular  object. 

5.3.  Moment  Invariant  Feature  Classification 

This  section  concerns  the  classification  of  the  moment 
invariant  features.  The  following  subsection  describes  the 
input  training  data.  Next,  the  network  architecture  is 
discussed,  followed  by  the  classification  results. 

5.3.1.  Input  Feature  Data 

The  classes  considered  for  this  study  consisted  of  tanks 
(TA),  Trucks  (TR),  and  armored  personnel  carriers  (AP). 
Initially,  there  were  targets  generated  from  two  fields  of  view, 
narrow  and  wide.  The  narrow  field  of  view  consisted  of  3.43 
degrees  in  the  horizontal  and  3.57  in  the  vertical.  The  wide 
field  of  View  consisted  of  10,32  degrees  in  the  horizontal  and 


7.74 

degrees 

in  the  vertical. 

For 

reasons  to 

be  explained 

later, 

the 

narrow 

field 

of  view 

objects 

were 

used 

exclusively. 

There 

were 

104  feature 

vectors. 

of 

which  75X 

were 

used  for  training, 

while  the  remainder  were  used  for  testing.  Table  5.5  breaks  down 


the  number  of  samples  for  each  category  for  the  narrow  field  of 


view  targets. 


Table  5.5  Target  Data  Base 


Class 

#  Training  Samples 

#  Test  Samples 

TA 

25 

7 

TR 

25 

5 

AP 

25 

17 

There  was  a  relatively  even  distribution  of  vectors  describing 
each  target,  allowing  an  even  distribution  of  the  training  data, 
as  opposed  to  the  breaKdown  listed  in  Table  4.4.  In  general, 
the  more  examples  of  an  object  the  networK  has  to  train  on,  the 
better  chance  the  network  has  of  learning  that  object. 

5.3.2.  Network  Architecture  and  Learning  Parameters 

The  multilayer  perceptron  consists  of  3  layers,  which  will 
receive  36  input  features  and  output  4  classes.  Table  5.6 
describes  the  network  architecture,  while  Table  5.7  defines  the 
predetermined  learning  parameters. 
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Table  5.6  Network  Architecture  Data 


I 


Number  of  Features 

36 

Layer  one  nodes 

27 

Layer  One  Nodes 

9 

Number  of  Classes 

3 

Table  5.7  Network  Training  Data 


Parameter 

1  ■ 

Gradient  Method 

Momentum  Method 

SO  Method 

a 

1 

0.  1 

0.  1 

0.  1 

a 

2 

0.  0 

o 

6 

0.  0 

a 

3 

o 

o 

0.  0 

0.  0 

a 

4 

1.  0 

0.  1 

0.  1 

a 

5 

0.  0 

0.  0 

0.  1 

Number  of 
Iterations 

20,  000 

20,  000 

20,  000 

Data  Output 
Interval 

400 

400 

400 

Many  combinations  of  learning  parameters  were  tried,  but  the 
search  was  not  exhaustive.  The  learning  parameters  listed  in  the 
above  table  provided  the  best  results. 
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5.3.3.  Classification  Results 


Tlie  same  comparisons  as  tnose  made  in  section  4. 4. 3.1  will 
again  be  drawn  in  this  section  for  the  PLIR  imagery  features. 
The  average  classification  accuracy,  as  well  as  log  error  plots 
are  considered.  Again,  results  of  all  three  minimization  methods 
will  be  compared. 

5. 3. 3.1.  Average  Classification  Accuracy 

Here  again,  the  same  criterion  used  for  classification  in 
section  5. 2. 3.1,  is  used  for  these  accuracy  results.  Initially, 
both  wide  and  narrow  field  of  views  were  used  for 
classification.  However  classification  accuracies  never  exceeded 
63/  on  the  training  data.  The  images  generated  from  the  wide 
field  of  view  produced  poor  target  resolution.  Targets  were  not 
distinguishable  from  non-target  blobs,  much  less  from  one 
another,  to  the  human  observer.  The  results  of  removing  those 
features  segmented,  from  the  wide  field  of  view  images,  from  the 
target  data  base  are  displayed  below.  Figures  5.10,  5.ii  and 
5.12  display  the  average  training  accuracy. 
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18000 
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Figure  5.10  The  network  achieves  accuracies  just  under  95>;. 
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AVERAGE  NETWORK  TRAINING  PERFORMANCE  FOR  MOMENTUM  METHOD 

Figure  5.11  The  network  achieves  accuracies  of  98X. 


5- 


Figure  5.12  The  network  achieves  accuracies  of  98X. 

Below,  Fig.  5.13  displays  the  first  4000  iterations  for  each 
method  on  the  same  graph  for  a  better  comparison. 
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NUMBER  OF  ITERATIONS 

COMPARISON  OF  AVERAGE  NETWORK  TRAINING  PERFORMANCE 

Figure  5.13  The  momentum  and  second  order  methods  clearly  exceed 
the  performance  of  the  gradient  of  steepest  descent.  In 
addition,  notice  that  the  second  order  method  on  the  average 
converges  somewhat  faster  than  the  momentum  method. 


Figures  5.14,  5.15  and  5.16  represent  the  test  data 


accuracies  of  each  method. 
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Figure  5.14  The  network  achieves 
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Figure  5.15  The  network  achieves  85/.  accuracy.  Notice  how  the 
accuracy  deteriorates  over  continuous  training. 


p 


i 


Figure  5.16  The  network  achieves  85X  accuracy.  Notice  how  the 
accuracy  deteriorates  over  continuous  training. 


Figure  5.17  displays  all  three  plots  over  the  first  4,000 
iterations  for  the  test  data. 


P 
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Figure  5.17  The  momentum  and  second  order  methods  clearly  exceed 
the  performance  of  the  gradient  of  steepest  descent.  In 
addition,  notice  that  the  second  order  method  performs  a  little 
better  than  the  momentum  method. 

Once  again,  the  gradient  of  steepest  descent  has  failed  to 
perform  on  a  comparable  level  with  the  other  two  minimization 
techniques.  In  the  case  of  the  doppler  imagery,  there  was  not 
the  significant  difference  between  the  different  techniques,  as 
observed  here  with  the  FLIR  imagery  feature  vectors.  A  closer 


look  at  the 

accuracy 

plots. 

reveals 

that 

the  SO  method  initially 

converges  a 

little 

faster  than  the 

first 

order  momentum  method, 

as  shown  in 

Fig. 

5.13 

and 

reinforced  in 

Fig. 5. 17. 

In  Figs. 

5.15 

and 

5.16, 

observe 

that 

after  about  1,200  or  so 

training  iterations  the  test  data  accuracy  begins  to  deteriorate. 
This  phenomenon  may  be  explained  more  clearly  by  analyzing  the 
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LOG(ERROR) 


average  total  output  error. 


5.3.3.  Average  Total  Output  Error 

Again,  ttie  average  total  output  error  is  defined  exactly  the 
way  it  was  in  section  4. 4.3.2.  However,  the  total  output  error 
was  averaged  over  75  feature  vectors  for  the  training  set  and 
averaged  over  29  vectors  for  the  test  set.  The  log  error  will  be 
plotted  for  the  training  data  and  test  data. 


LOG(NUMBER  OF  ITERATIONS) 

AVERAGE  NETWORK  TRAINING  PERFORMANCE  FOR  GRADIENT  METHOD 


Figure  5.18  Notice  the  smooth  descent  over  all  training 
iterations. 


LOG(ERROR) 


Figure  5.19  Hotice  tHe  reduction  in  error  over  the  gradient 
method. 
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AVERAGE  NETWORK  TRAINING  PERFORMANCE  FOR  SECOND  ORDER  METHOD 

Figure  5.20  Note  the  minimal  difference  between  the  momentum  and 
second  order  methods. 
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Notice,  that  in  each  graph  the  log  of  the  error  is  basically 
a  smooth  decreasing  curve  to  about  1,200  training  iterations, 
after  which  the  curve  becomes  very  unstable.  The  average  output 
error  for  the  test  data,  will  provide  a  bit  more  insight  to  this 
peculiarity.  The  average  output  error  for  the  test  data  will 
also  support  the  information  contained  in  Figs.  5.15  and  5.16. 
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AVERAGE  NETWORK  TESTING  PERFORMANCE  FOR  GRADIENT  METHOD 

Figure  5.21  Notice  the  increase  in  error  at  roughly  10,000 
Iterations. 
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AVERAGE  NETWORK  TESTING  PERFORMANCE  FOR  MOMENTUM  METHOD 

Figure  5.22  Notice  the  increase  in  error  at  roughly 
iterations. 


AVERAGE  NETWORK  TESTING  PERFORMANCE  FOR  SECOND  ORDER  METHOD 


Figure  5.23  Notice  the  increase  in  error  at  roughly 


In  each  error  plot,  the  error  is  decreased,  to  some  minimum 
and  then  abruptly  begins  to  increase.  This  phenomenon  agrees 
with  observations  made  in  Figs.  5.15  and  5.16. 

Considering  Figs.  5.16  and  5.19,  it  appears  the  network  is 
learning  and  converging  on  the  optimal  weight  values  which 
minimize  the  error  surface.  However,  as  the  feature  vectors  are 
presented  to  the  network,  over  and  over,  a  point  is  reached  when 
the  network  attempts  to  find  an  exact  solution  to  the  optimum 

weight  values.  If  the  region  of  the  minimum  is  a  relatively  flat 
region,  there  may  be  many  solutions.  Therefore,  the  net  begins 
to  meiuorlze  and  the  network  loses  its  ability  to  generalize  and 
abstract  the  essence  of  the  target.  This  may  explain  why  the 

average  network  performance  begins  to  degrade  on  the  test  data. 
The  network  expects  data  which  closely  resembles  the  training 
data.  When  it  does  not  see  this  resemblance,  then  it  makes  an 

inaccurate  decision. 

5.4.  Summary 

Classification  of  the  target  and  non-target  features  using 
the  neural  net  classifier  exceeded  the  performance  of  the 
Bayesian  classifier  for  the  training  data.  But,  the  Bayesian 

classifier  performed  better  on  the  testing  data.  However,  by 
applying  a  more  lenient  classification  accuracy  on  the  neural 
net  classifier  it  is  believed  that  the  network  performance  will 
be  improved.  Appendix  F  provides  the  results  when  applying  a 
more  lenient  criterion  to  the  neural  net. 


5-27 


Classification  of  the  moment  invariant  features  proved  quite 
successful.  The  momentum  and  second  order  methods  clearly 
exceeded  the  performance  of  the  gradient  method  of  steepest 
descent.  Classification  accuracies  near  perfection  for  the 
training  data  were  measured,  while  accuracies  of  85X  were 
achieved  for  the  test  data.  Unfortunately,  there  was  not  a  big 
difference  between  the  performance  of  the  momentum  method  and 
second  order  method,  other  than  the  second  order  method  providing 
a  slight  improvement  in  convergence  time. 

Chapter  6  entertains  a  general  discussion  on  the  results  of 
all  three  minimization  techniques  used  in  chapters  4  and  5. 
Aside  from  these  discussions,  recommendations  and  conclusions  are 
provided  to  conclude  the  thesis  effort. 
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6.  Di»cua«ion».  Recommendations  and  Concmaions 

Ttie  following  sections  provide  the  closing  remarks 
concerning  the  results  of  this  thesis  effort.  The  chapter  begins 
with  a  discussion  on  the  general  results  and  findings  provided 
in  chapters  4  and  5.  Furthermore,  the  areas  of  further  study  in 
pattern  classification  using  neural  networks,  specific  to  this 
research,  are  considered  in  section  6.2,  The  conclusion 
discusses  the  contributions  of  this  effort  in  the  field  of 
tactical  target  recognition  using  neural  network  classifiers. 

6.1.  Discussions 

The  findings  obtained  in  chapters  4  and  5  provide  some  very 
interesting  results.  For  example,  with  three  different  sets  of 
input  feature  vectors,  it  was  shown  that  there  were  distinct 
disadvantages  of  applying  the  gradient  method  of  steepest 
descent  as  a  minimization  technique.  In  every  case  studied  the 
momentum  and  second  order  approximation  methods  exceeded  the 
performances  of  the  gradient  of  steepest  descent  technique. 
There  are  several  reasons  for  this  and  a  few  are  discussed  below 
in  the  paragraphs  to  follow. 

When  applying  the  steepest  descent  method,  recall  that  the 
weights  are  being  updated  in  small  constant  proportional 
increments  of  the  partial  derivative  of  the  performance  surface. 
If  the  minimum  of  the  error  surface  lies  in  a  relatively  flat 
hyperplane,  the  reduction  in  error  will  be  very  slow.  In 
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addition,  ttie  existence  of  local  minima  could  also  pose  a 

significant  threat  to  convergence.  Another  reason  may  lie  in  the 
fact  that  the  gradient  (the  vector  of  partial  derivatives)  may 

not  point  In  the  direction  of  the  global  minimum.  In  weight 
hyperspace,  the  partial  derivative  of  the  error  surface  (the 
slope)  In  one  weight  direction  may  be  far  greater  than  the 
partial  in  yet  another  weight  direction.  Hence,  the  gradient 
would  not  necessarily  point  to  the  global  minimum  and  the  weight 
update  may  be  of  little  consequence  in  convergence.  Thus,  these 
results  confirm  and  reinforce  the  ideas  and  concepts  of  many 
researchers  throughout  the  literature. 

However,  there  is  disagreement  by  many  in  the  field  of 
neural  networKs,  as  to  the  best  way  to  accelerate  convergence.  A 
seemingly  controversial  means  of  acceleration  Is  the  momentum 
term.  Again,  In  all  applications  in  this  study,  the  momentum 
method  clearly  exceeded  the  performances  of  the  steepest  descent 
method.  One  reason  for  this,  is  the  additional  amount  of 

information  the  momentum  method  provides  to  the  network 
concerning  the  error  surface.  With  this  method,  the  network  is 

allowed  to  look  back  in  time  by  one  time  step.  This  allows  the 

network  to  add  the  so-called  momentum  term.  This  momentum  term 
is  simply  a  weighted  version  of  the  weight  update  from  the 
previous  time  step,  allowing  convergence  to  continue  in  the  same 
direction  as  the  previous  step.  If  the  current  update  has  the 

same  sign  as  the  previous  update  then  the  convergence  towards  the 

minimum  is  accelerated.  If  the  signs  are  opposite  then  the 


update  is  small;  it  is  hoped  that  this  will  prevent  the  network 
from  over  shooting  the  minimum.  However,  this  may  not  be  the 
case  and  the  algorithm  may  also  still  be  susceptible  to  local 
minima.  Never  the  less,  in  general,  the  momentum  term  has  the 
effect  of  adding  a  quadratic  term  to  the  minimum  of  the  error 
surface.  This  term  performs  an  average  of  the  current  and 
previous  updates  and  the  result  is  a  smoothing  of  the  error 

surface.  This  study  shows  the  momentum  method  to  be  quite 

effective,  when  compared  to  the  steepest  descent  method. 

The  second  order  approximation  to  Newton's  method  proved 
Just  as  successful  as  the  momentum  method  and  in  some  instances 
slightly  accelerated  convergence.  The  basic  concept  behind 
accelerating  convergence  is  to  provide  the  network  with  as  much 
information  as  possible  about  the  ever  changing  error  surface. 
In  doing  so,  the  decisions  made  by  the  network  are  made  faster, 
more  decisive  and  accurate.  In  using  second  derivative 
information,  it  is  desired  to  gain  new  information.  For 

instance,  using  first  order  techniques  as  those  described  above, 
all  the  information  about  the  past  training  is  stored  the 
positional  values  of  the  weights.  Second  order  methods  store 

information  about  the  local  shape  of  the  error  surface,  as  well 
as  maintaining  the  positional  information.  In  addition,  the 
algorithm  used  in  this  study,  applies  time  derivative 
information.  With  each  new  input  presented  to  the  network,  the 
performance  surface  changes.  Therefore,  when  considering  each 
time  step,  the  performance  surface  is  changing  with  time.  The 
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time  derivatives  of  signals  propagating  through,  the  network, 
provide  the  network  information  regarding  how  the  performance 
surface  changes  with  time.  Any  acceleration  in  convergence  over 
the  other  two  methods  (within  this  study)  has  to  he  attributed  to 
this  added  information  this  algorithm  is  providing  the  network. 

The  reason  there  was  not  much  difference  between  the 
momentum  and  second  order  methods  is  not  well  understood. 
Perhaps  the  very  fact  that  the  algorithm  used  in  this  study,  is 
merely  an  approximation  to  the  more  powerful  Newton's  method 
provides  the  answer.  After  all,  the  actual  implementation  of 
this  approximation  provided  an  additional  term  to  the  already 
existing  momentum  method  within  the  algorithm  (see  Eq.  2.6). 
This  additional  term  contained  all  of  the  second  derivative 
information,  as  well  as  providing  the  time  derivative 
information.  The  following  section  provides  recommendations  for 
areas  of  further  study,  to  include  techniques  which  may  provide 
more  information  than  the  momentum  method,  and  thus  accelerate 
convergence. 

6.2.  Recommendations 

The  second  order  algorithm  approximating  Newton's  method 
provided  some  promise  for  pattern  classification.  However,  from 
the  results  of  chapters  4  and  5,  the  second  order  algorithm 
provided  very  little  improvement  over  the  momentum  method,  other 
than  a  slight  increase  in  convergence.  Therefore,  further 
studies  are  required  for  improving  the  performance  of  neural  net 
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classifiers. 


In  cnapter  2,  it  was  shown  that  Parker  [8]  derived  an 
approximation  to  the  powerful  iterative  technique  known  as 


Newton's  method. 


More  specifically,  the  algorithm 


approximates  the  following  second  order  Newton's  method: 


-=[ 


awdwT 


-1  ds 

.  p* — . 


Many  problems  arise  when  attempting  to  solve  this  linear 
differential  equation.  First  of  all  the  number  of  operations 
becomes  a  cubic  function  of  the  number  of  weights  (n)  in  the 
network.  These  are  the  number  of  operations  required  to 
invert  the  matrix.  However,  as  shown  in  appendix  D,  the 
above  matrix  has  a  singularity  and  therefore  is  not 
invertible.  For  these  reasons  and  more,  explains  why  an 
iterative  approach  was  used  to  approximate  the  above  linear 


differential  equation.  The  approximation  basically  concerned 
the  splitting  of  the  matrix  as  show  in  appendix  A.  There  are 
many  ways  to  approximate  the  above  equation  by  iterative 
techniques. 

Aside  from  the  approximation  to  Newton's  method  applied 
in  this  study,  other  approximations  should  be  explored.  One 
such  approximation  may  apply  the  Jacobian  method  for 
splitting  a  matrix,  in  lieu  of  using  the  identity  matrix. 
Another  method  may  employ  a  Gauss-Seidal  method.  Still, 
another  method,  which  may  prove  to  be  even  more  powerful,  is 
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ttie  successive  overrelaxation  method  which  is  a  generalized 


version  of  the  Gauss-Seldal  method.  These  methods  allow  the 
inversion  of  a  portion  of  the  matrix,  such  as  the  diagonal, 
rather  than  no  inversion  at  all.  To  invert  the  diagonal 
matrix  requires  a  simple  computational  process. 

A  considerable  amount  of  Information  is  Known  about  the 
second  partial  derivative  matrix  as  eluded  to  in  appendix  D. 
For  instance,  the  concavity  of  the  matrix  is  solely  dependent 
upon  the  weighted  input  matrix.  Hence,  the  more  we  know 
about  the  input,  the  more  we  know  about  the  concavity  of  the 
performance  surface.  Second  order  techniques  use  the 
concavity  of  the  performance  surface  in  computing  a  new 
update  to  the  parameter  being  optimized.  An  example  of  where 
this  discussion  is  heading  is  provided.  Suppose  that  the 
input  components  were  orthogonal  to  one  another.  This 
provides  an  orthogonal  matrix.  With  this  added  information 
on  the  input,  it  would  be  a  simple  process  to  invert  the 
matrix,  since 

-1  T 

A  =  A  . 

2.  The  algorithm  also  approximated  the  average  second  partial 

derivative  of  the  performance  surface  as  it's  instantaneous 
value.  A  better  approach  may  be  to  calculate  the  average 
second  partial  matrix  or  some  portion  of  it.  A  means  of 
computing  this  average  may  take  the  following  form: 
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avg 


^  d^s  ’I 

V  d^s 

,  dwdw'^  j 

-  P  •  v 

^  dwdwT 
k:0 

•  exp(  -M*  (  1/  -  k  )  ), 
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where  v  the  represents  the  maximum  number  of  time  steps 
desired  to  average,  while  K  represents  the  current  time 
step.  The  above  equation  provides  a  weighted  memory  of  the 
past,  with  the  most  recent  input  receiving  the  most  weight. 

Another  alternative  would  be  to  explore  a  batch 
technique.  The  idea  is  to  process  all  of  the  input  data  and 
then  average  the  total  output  error  and  corresponding 
partials.  The  respective  update  equation  would  then  have 
information  pertaining  to  all  inputs  and  again  provide 
somewhat  of  a  memory  of  all  input  data  from  the  environment. 

3.  A  more  thorough  investigation  of  the  Roggemann  FLIR 

features  is  required.  The  number  features  used  for  target 
and  non-target  classification  could  be  increased  from  the 
three  used  in  this  study.  Later  findings  by  Roggemann  show 
that  the  Bayesian  classifier  improved  dramatically  by 
increasing  the  number  features  (  to  9).  The  neural  networK 
performance  should  improve  given  the  additional  information 
on  the  input. 


6.3.  Conclusions 

The  second  order  approximation  to  Newton's  method  proved 
quite  successful  in  pattern  classification  applications.  In  some 
instances,  it  slightly  accelerated  convergence.  The  network  was 
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able  to  classify  targets  witli  a  moderately  bigb  degree  of 
accuracy.  Tbe  classification  of  features  segmented  from  the  FLIR 
imagery,  is  truly  exciting.  Each  method  provided  classification 
accuracies  of  the  test  data  at  close  to  65/.  and  near  perfect 
accuracy  on  the  training  data,  when  using  the  moment  invariants 
as  features. 

There  were  four  basic  contributions  made  during  this  thesis 
effort.  First  of  all,  this  research  effort  has  provided,  tested 
and  validated  a  new  biologically  based  neural  network  classifier. 
The  network  applies  second  derivative  information  concerning  the 
second  partial  derivatives  of  the  performance  surface  (in  this 
case  error  surface).  In  addition  to  providing  second  derivative 
information  this  algorithm  also  provides  information  about  how 
the  surface  is  changing  with  time.  Even  though  the  algorithm  did 
not  provide  a  significant  improvement  in  convergence  time  or 
accuracy,  it  still  performed  on  a  comparable  level  with  the 
momentum  method.  This  result  alone  adds  validity  to  the  concepts 
behind  the  algorithm  and  continued  study  in  this  area  is 
warranted.  Furthermore,  this  algorithm  allowed  for  an  easy 
comparison  between  three  different  minimization  techniques.  The 
results  of  this  thesis  clearly  demonstrates  the  advantages  of 
using  the  momentum  and  second  order  methods  over  the  steepest 
descent  method. 

Secondly,  the  success  of  the  artificial  neural  network 
classifier  reinforces  the  fact  that  they  can  be  very  effective  in 
applications  on  automated  target  recognition.  In  comparison  with 


the  Bayesian  classifier,  results  demonstrate  that  the  neural 
network  classifier  exceeded  the  performance  of  the  statistical 
classifier  on  the  training  data.  However,  the  Bayesian 
classifier  performed  better  on  the  test  data.  When  the 
classification  criterion  was  less  stringent  (and  comparable  to 
the  Bayesian  criterion),  the  neural  net  classifier  using  the 
second  order  method  further  exceeded  training  performance  levels. 
The  test  data  accuracy  approached  the  performance  levels  of  the 
Bayesian  classifier.  This  can  be  observed  from  the  results  in 

Appendix  E.  These  results  reinforce  earlier  results  found  by 
Ruck  [12],  demonstrating  the  superior  performance  of  neural  net 

classifiers  over  statistical  classifiers. 

The  third  contribution  concerns  network  generalization. 
Results  show  that  there  may  be  a  definite  dividing  line  between 
the  network  actually  learning  and  memorizing  its  environment  over 
continuous  training.  Once  that  dividing  line  is  crossed  the 
network  begins  to  memorize,  thus  destroying  it's  ability  to 

generalize  or  learn  from  its  environment.  In  this  study,  the 

test  data  accuracy  began  to  deteriorate.  This  was  particularly 
noticeable  with  the  FLIR  imagery  features.  When  the  classifier 
processed  the  doppler  imagery  features,  this  phenomenon  was  not 
as  noticeable.  This  should  not  be  terribly  disturbing,  since  the 
data  base  consisting  of  doppler  imagery  features  was  so  heavily 
influenced  by  tank  features.  The  network  may  not  have  seen  the 
other  features  enough  to  draw  on  such  a  conclusion. 

The  fourth  and  final  contribution  reveals  that  the 
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classifier  has  the  same  disadvantages  as  a  human  observer.  For 
instance,  when  the  field  of  view  was  wide,  the  resolution  of  the 
object  of  Interest  was  poor.  The  actual  target  was 
indistinguishable  by  a  human  observer.  Using  a  narrower  field  of 
view,  provided  better  resolution  and  the  target  blobs  were  very 
distinct.  When  using  features  generated  from  wide  and  narrow 
fields  of  view  the  classification  accuracy  never  grew  higher 
than  63X  on  the  training  data.  After  removing  the  objects  from 
the  wide  field  of  view,  classification  increased  dramatically. 

In  a  supervised  training  environment,  feeding  the  network  a 
poorly  resolved  object  with  a  classification  label,  is  the  same 
as  lying  to  the  network. 


Appendix  A:  An  Introduction  to  Newton's  Method  and  Iterative 
Methods 

In  chapter  two,  the  approach  to  Parker's  second  order 
derivation  [8]  was  introduced.  In  this  appendix,  the  steps 
which  were  left  out  are  provided  here  in  detail.  The  first 
section  will  cover  the  missing  steps  in  the  derivation  of  the 
second  order  Newton's  method.  This  will  immediately  he  followed 
by  an  introductory  discussion  on  Newton's  method  in  general. 
Parker  chose  to  approximate  Newton's  method,  by  solving  the 
linear  differential  equation  by  applying  an  iterative  approach. 
Therefore,  an  iterative  approach  will  be  discussed  in  general,  in 
the  third  section.  The  final  topic  for  discussion  concerns 
convergence  and  the  addition  of  Parker's  leakage  terms  to  the 
approximation  derived  from  the  iterative  approach.  The 
following  text  is  the  result  of  conversations  and  notes  taken 
from  interviews  with  Dr.  Mark  Oxley  [6]. 


A.l.  Derivation  of  Newton's  Method 

The  derivation  begins  with  a  restatement  of  Eq.  2.4,  where 
the  functional  dependencies  of  s  on  t,  and  w(t)  have  been 

suppressed  for  convenience.  The  average  instantaneous 

performance  is  given  by: 


avg  ( s  ) 


p 


't 

s  •  e~^  f  t— T  ) 

—00 


Assume  that  t  is  fixed  and  the  above  equation  will  provide  a 
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snapshot  of  the  performance  surface.  The  equation  for  the 
optimum  weights  is  being  derived  from  an  optimum  criterion.  This 
implies,  the  derivative  of  the  average  performance  surface  with 
respect  to  the  weights,  evaluated  at  the  optimum  weights  (w*)  is 
zero.  Temporarily,  let 

davg ( s  )  rt  ds 

q  :  -  :  0  :  p*  - .  g— P  (  t— T  )  j 

aw*  J— CO  dw* 

Now,  let  t  vary  and  q  becomes  a  constant  function  of  time. 
Again,  from  an  optimum  perspective,  as  s  changes  the  weights 
continue  to  follow  the  moving  minimum.  The  next  step  is  to  apply 
Leibniz's  rule  and  compute  the  time  derivative  of  Eq.  A.l,  where 

aq  as  ft  a  r  as  ^ 

-  :  0  :  P* -  ♦  P*  -  -  .  g  — P  (  t  — T  )  (Jy 

at  aw*  J— 00  atV  <3w*  ; 

't  as 

—  p2  ,  - ,g  p(t  T) 

—CO  dw* 

Notice  that  the  second  integral  term  is  equal  to  -pq,  and 
therefore  equal  to  zero.  In  the  first  integral,  the  only  way  s 
depends  on  t  is  through  w(t)  and  not  on  the  input  which  depends 
on  T.  Therefore,  the  following  relationship  holds: 
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such  that  aq/at  becomes* 
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The  exponentially  weighted  time  average  of  the  second  derivative 
IS  a  matrix  (see  appendix  B).  If  the  above  matrix  is  invertible 
(see  appendix  D)  an  explicit  first  order  differential  equation 
for  the  optimum  path  of  the  weights  is: 


dw* 

avg 

f  1 

n 

—  1  d  s 

dt 

i,  dw*dw'‘'^  J 

♦  p  • 

dw* 

A.  2 


Thus  Eq.  2.5  has  been  derived.  Equation  A. 2  is  a  second  order 
Newton's  method. 


A.2.  Newton's  Method  in  General 

Newton's  method  is  an  iteration  method  for  solving  equations 
of  the  form  f(x)  =  0,  where  f(x)  has  a  continuous  derivative. 

The  method  is  commonly  used  because  of  its  simplicity  and  great 
speed  to  convergence.  The  general  idea  is  to  approximate  the 
graph  of  f(x)  by  suitable  tangents,  see  Fig.  A.l. 
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Figure  A.l.  An  illustrative  graph  of  Newton's  method  [3] 


Using  an  approximate  value  xq  obtained  from  the  graph  of  f(x), 
let  Xj  be  the  point  of  intersection  of  the  x-axis  and  the  tangent 
to  the  curve  at  f(xo).  Then 


df (X )  f (xq I 

tan  B  ■  -  :  - 

dx  xq  —  xi 

where 

f  (Xq  ) 

xi  =  xo - . 

df ( Xq ) 


dx 


In  the  second  step,  X2  is  found  from  x^  and  in  the  following  step 
X3  IS  found  from  X2.  This  is  performed  until  -  0.  In 


general,  Newton's  method  becomes; 


XK  - 


df  (Xjt) 

dx 


Equation  A.2  is  a  straight  forward  continuum  extension  of  the 
above  Newton's  method.  It's  desired  to  have  the  partial  of  s 
with  respect  to  w"  approach  0. 

A. 3.  An  Iterative  Method  for  Ax  =  b 

If  the  matrix  of  Eq.  A.2  is  not  invertible  (and  it  is  not) 
or  calculating  the  second  order  Newton's  method  is  tedious,  there 
IS  another  approach. 

Consider  the  following  expression  given  by 

A(t.  x(t)  ).x(t)  :  b. 

which  IS  in  the  same  form  as  Eq  A.2,  with  the  matrix  on  the  right 
hand  side  of  the  equation.  Let  Ax  =  b,  to  reduce  the  rotational 
overhead.  One  approach  to  approximating  the  vector  x  is  to 
perform  an  iterative  method.  First,  multiply  the  matrix  A  and 
the  vector  b  by  BAt.  The  0  controls  the  rate  of  convergence 
and  At  is  a  small  time  increment  which  will  later  go  to  zero. 
Then  add  and  subtract  the  identity  matrix  from  the  resulting  A 
matrix.  This  is  a  means  of  splitting  the  A  matrix  and  is 
accomplished  in  the  following  manner: 

(  B*At»A  +  I  -  I  )*x  --  B*At*b, 
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and 

X  :  X  —  B*At»A‘X  ♦  0»At«b. 

( 

The  next  step  allows  the  iterative  approach  to  taKe  full  form. 
The  idea  is  to  use  the  partial  time  derivative  of  x(t)  to  compute 
an  improved  estimate  of  the  partial  time  derivative  of 
x(t+At).  It  is  desired  to  use  some  Known  vector  to  predict 
an  improved  version  of  that  same  vector.  It  is  hoped,  by 
performing  this  process  in  an  iterative  manner,  the  time 
derivative  of  x(t  +  At)  will  eventually  be  obtained.  Returning 
the  functional  dependencies  for  clarity,  the  iterative  approach 
assumes  the  following  form: 

X(t>At)  :  X(t)  -  B*At‘A(t.  X(t))‘X(t)  ♦  B*At*b, 

where 

X( t+At  )  -  X( t  ) 

-  --  -  B*A(  t,  x(  t )  ) ‘Xl  t  )  f  B-b 

At 

In  the  limit  as  At  approaches  zero,  defines  the  partial  time 
derivative  of  x(t),  that  is 


ax(  t ) 

-  --  -  B*A(t,  x(t))*x(t)  ♦  B'b. 

at 

Let 


aw*  /'  a^s  'I  as 

x(t)  :  - ,  A(  t,  x{t))  --  avg  -  1,  b  =  -M  • - 

at  I  aw^aw*"^  j  aw* 


and  then 

( t ) 
dt2 


B*  avg 


d2. 


I  aw^dw^T 


dw^ ( t  ) 


dt 


ds 

B*  u - - 

dw* 


A.  3 


The  (+)  notation  is  used  to  imply  an  approximate  value.  Hence, 
Eq.  2.7  is  derived. 


A.4.  Addition  of  Leakage  Terms  and  Convergence 


From 

Eq. 

A. 3,  Parker  adds 

leakage 

terms 

to  insure  the 

network 

Will 

converge 

to  some 

minimum. 

since 

the  Iterative 

approach 

IS 

no  more 

than  an 

approximation 

[7:593-600;  81. 

Consider 

the 

following 

argument. 

Suppose 

that 

the  Iterative 

approach  succeeds  in  driving  x(t+At)  -  x(t)  to  0.  Thus, 

A*X*  :  b 

implying  that  approximate  x*  lies  in  a  family  of  solutions 
consisting  of  linear  combinations,  if  A  is  not  invertible.  In 
other  words,  there  are  a  number  of  solutions  for  x'*^,  which 
satisfy  the  above  equation.  Many  of  these  solutions  may  lie  in 
local  minima.  Thus,  convergence  to  the  optimum  x*  may  never  be 
achieved.  Therefore,  a  method  must  be  sought  to  insure 
convergence. 

Parker  argues  that  a  natural  way  to  insure  convergence  is 
the  addition  of  leakage  terms  [8].  It's  assumed  that  the 
integrators  required  to  implement  the  algorithm  are  indeed  leaky. 
An  analogy  drawn  by  Parker  concerns  that  of  an  analog  circuit. 
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Consider  electrons  leaking  off  of  a  capacitor  which  stores  energy 


in  the  form  of  charge.  Since  all  practical  integrators  have  a 
leakage  rate,  it  seems  logical  to  take  them  into  consideration 
[8]. 


Consider 

Eq. 

A, 3.  The  first 

step 

IS 

to 

calculate, 

the 

second 

partial 

time  derivative  of 

W*,  and 

then 

integrate 

to 

obtain 

the  first 

partial  time  derivative 

of 

w^ 

denoted  as 

By  integrating  q"^,  w"*"  is  obtained; 


q" 


w^ 


't 

^  a^s  ■) 

— 

B  •  avg 

,  aw*aw*‘^  J 

—00 

't 

q  *  d  T. 

—00 


d  s 

B  •  M  • - 

dw* 


J 


dT, 


Next,  take  the  time  derivative  of  each  of  the  equations  above: 


aq* 

at 


B*  avg  ^ 


d^s 

aw^aw^T 


as 

B  •  u  - - , 

aw* 


aw* 

-  --  q*. 

at 


Since  two  integrations  are  performed,  then  two  leakage  terms  are 
required.  Therefore,  subtracting  the  respective  leakage  terms 
from  each  of  the  above  equations  has  the  following  result: 


aq 

-  B*avg 

a^s 

a  s 

i  •  rt  D  «  1 1  «  — 

•  q  —  p  •  p  • - 

at 

,  awaw"^  J 

'  aw 

dw 

-  :  q  -  1  1  ‘W.  A.  5 

at 


Tile  (+)  notation  has  heen  removed,  sin  e  it  is  now  believed  that 
a  better  approximation  of  the  quantities  of  the  left  hand  side  of 
each  equation  has  been  found.  It  is  now  desired  to  remove  the 
dependency  on  q,  by  taking  the  time  derivative  of  Eq.  A, 5: 


a2w 

aq 

-  If 

aw 

at2 

at 

at 

and  substituting 

Eq. 

A.  a 

into  Eq.  A.  4 

a^w 

(  ^2 

A.  6 


:  -  avg 


at' 


awawT 


as  aw 

q  —  —  -  1  -q  -  1  . — 


aw 


1 


A.  7 


at 


and  finally,  rewriting  Eq.  A.  5,  such  that 

aw 

q  :  —  *  1  1  •  w 
at 

and  substitute  q  into  Eq.  A.  5,  which  yields  the  final  result: 


a^w  as 

-  :  — B  •  P  • -  — 

at  aw 


To  simplify  the 


r  d2s  ^ 

1  *  1  *1  +  B • 1  • avg 

1  1 

[  1  2  1 

L  awaw^  j 

^  j 

r 

1 

r  a2s  ^ 

{1  +  1  )*I  *  B*avg| 

1 

1  — 

1  2 

V  ' 

t  awaw'^  , 

above  equation  somewhat,  let 

a  1  ;  B*  p, 


•  w 

"  aw 

I  at 
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^2  *  ^  1  *  ^  2» 
as  :  B‘  1  1, 

-  1  ^  1 and 

as  :  3. 

Reparameterizing  Eq.  A. 6,  results  in  tlie  following: 


a2w  as 

1  ^ 

-  :  —a  • —  — 

at  ^  aw 

a  •!  +  a  -avg 
2  3 

1 

,  awaw"^  j 

r 

f  1 

,  ^ 

— 

1  a  • I  +  a  • avg 

1  5 

j  - 

L  awdw"^  j 

^  J 
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The  result  of  adding  the  leakage  terms,  is  that  it  provides 

the  same  effect  as  adding  a  momentum  term  (see  appendix  D)  and 

additive  noise.  The  momentum  term  has  the  effect  of  smoothing 

the  error  surface.  The  basic  concept  behind  momentum,  is  to 
suppress  local  minima,  and  enhance  the  global  minimum.  The 
partial  time  derivative  of  w  associated  with  a^  is  used  in 
introducing  the  momentum  term  (see  section  3.4.2).  The  terms 
associated  with  ag  and  a^,  combine  to  introduce  noise  into  the 

network. 
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Appendix  B:  Linear  Algebraic  Forms  and  Hotatlon 

In  chapters  two  and  three,  the  equations  introduced  were 
heavily  dependent  on  linear  algebraic  forms.  This  appendix 
serves  as  an  attempt  to  clear  up  the  notational  overhead.  This 
section  also  provides  examples  of  elementary  linear  algebra  in 
the  form  of  matrix  addition  and  multiplication.  In  the 
discussions  below,  each  vector  is  considered  a  column  vector 
unless  otherwise  specified  as  the  vector  transpose. 

The  problem  posed,  is  to  expand  the  first  and  second  order 
partial  derivatives  of  the  performance  quantity  (s)  and  expose 
the  impending  linear  algebra.  In  chapter  three,  the  network 
performance  quantity  (s)  was  introduced  as  the  squared  error. 
Consider  the  first  partial  derivative  of  the  network  performance 
indicator  (s).  The  partial  derivative  of  s,  as  introduced  in 
chapter  three,  has  the  following  form  when  written  in  terms  of 
matrices, 

as  ae"^ 

—  :  2- - .« 

aw  aw 


0-1 


r  2 


where 


■ 


Kal 

B9 

i9 

mm 

MB 

KOI 

1 

H 

1 

■ 

■ 

■ 

1 

■ 

- 

e 

1 

e 

» 

2 

e 

m 

j 

" 

" 

ds 

de 

— 

E  - *“-e 

dw 

1  dw  * 

1 

1 

ds 

de 

— 

E  — L.e 

dw 

1  dw  1 

2 

:  2  • 

2 

ds 

de 

— 

E  - l-«e 

dw 

dw  * 

n  . 

n 
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Tne  partial  derivative  of  s  with  respect  to  w  is  a  column  vector. 
When  the  performance  indicator  is  defined  as  error,  the  first 
partial  is  expressed  as  the  sum  of  the  partials  of  each  error 
signal  with  respect  to  a  given  weight.  The  partial  with  respect 
to  each  weight  in  the  network,  is  the  direction  of  the  gradient 
and  points  toward  the  maximum  of  the  performance  surface. 

Therefore,  the  discussion  begins  with  the  second  partial  of  s 
as  defined  in  chapter  three. 


} 
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awdwT 


I  dw  dwT  dwdwT  j 


The  above  second  partial  is  a  matrix  as  is  each  of  its 
components.  Below  each  component  matrix  is  defined  and  expanded. 
First,  consider  the  partial  of  e  transpose  with  respect  to  the 
weight  vector,  where 
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and  similarly,  the  partial  of  e  with  respect  to  w  transpose 


t 


i 


de 


awJ 


de  de 


dw  dw 
1  2 

de  de 


dw  dw 
1  2 


de 

dw 

de 

dw 


n 


n 


de 

de 

de 

_ m 

_ m 

_ m 

dw 

dw 

dw 

1 

2 

n 
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The  product  of  the  matrices,  Eqs.  B.3  and  B.4.  determines 
the  first  component  of  the  sum  in  Eq.  B.2. 


de"^  de 
dw  dw^ 


de  de 


de  de 


de  de 


dw 

1 

dw 

1 

de 

1 

de 

1 

dw 

2 

dw 

1 

de 
_ L 

de  ^ 

dw 

dw 

1 

2 

de 

1 

de  ^ 

dw 

dw 

2 

2 

de 
_ L 

de  ^ 

dw 

dw 

1 

n 

de 

1 

de  ^ 

dw 

dw 

2 

n 

de 
_ L 

de  ^ 

1  dw  dw 
n  1 


1  dw  dw 
n  2 


1  dw  dw 
n  n 
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The  next  component  considered  is  the  second  partial  of  e 
transpose  with  respect  to  w  and  w  transpose.  This  matrix  is  a 
little  more  complicated  and  care  must  he  taken  to  insure  indices 
are  maintained  when  multiplying  by  e.  This  last  component  of  the 
second  partial  of  (s)  is  defined  as 


3-4 


d2eT 

dWdw'T 


m 


d2e 


^.'■a 


awidwj 


a2e 


E  e 


a2e 


E  e 


J_ 


1  ^  aw  aw  1  ^  aw  aw 

1  1 


a2e 


1  2 
a2e 


E  e  ♦- 


J_  2  e  - - L 


aw  aw  1 
2  1 


^  aw  aw 
2  2 


a2e 


a2e 


E  e 


2  e  - - L 


1  ^  aw  aw  1  ^  aw  dw 


n  1 


n  2 


a2e 


E  e 
1 


aw  aw 
1  n 


a2e 


E  e 
1 


aw  aw 
2  n 


a2e 


E  e 
1 


aw  aw 
n  n  J 
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i 


1 


■ 
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Appendix  C:  Partial  Derivatives  of  tne  Sigmoid  Function 

This  appendix  contains  all  the  significant  partial 
derivatives  of  the  sigmoid  function  used  in  this  study  of 
artificial  neural  networks.  Recall,  that  the  output  of  a  single 
cell  was  chosen  to  be  the  sigmoid  function,  as  introduced  in 
chapter  two.  In  the  various  implementation  stages  it  was 
necessary  to  compute  the  first  and  second  partial  derivatives 
with  respect  various  independent  variables.  The  independent 
variables  considered  were  the  inputs  to  the  cell  (fm)-  'the 
interconnection  weights  (w),  and  the  time  (t)  variables. 

For  clarity  and  simplicity,  a  single  cell  is  considered  with 
several  inputs  and  of  course  a  single  output,  as  shown  in  Fig. 
C.l. 


Figure  C.l  A  Single  Cell 

The  first  order  partials  will  be  considered  first,  followed 
by  the  second  order  partials.  The  computations  begin  with  the 
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partial  of  the  output  (fout)  respect  to  w. 


df  a  f 

_ QJUi  :  _ 

dw  dw  , 


1 


1  ♦  exp(— f"^  ‘W)  J 
in 


a  " 
aw  L 


expl— ‘W) 
in 


»-l 


j 


--  -!•( 


e  X  p  ( — f  •  w ) 

in 


exp{— ‘W) 
in 


T 

1  exp(-fin*w)  a 

- - - - - (  f  T 

1  ♦  exp(— ‘W)  1  ♦  exp(— ‘W)  aw 
in  in 


Using  the  following  relationship,  where 


exp(— a)  1 

-  :  1 - 

1  ♦  exp(— a)  1  ♦  exp(— a) 


and 


=  1  ~  ^out 

T 

a  r  fin’w. 


Taking  the  above  substitution,  the  partial  of 
to  w  becomes: 


df  f 

— aui.  :  f 

aw  L 


■V 

f  -f  . 

out  j  in 


ou 


For  example,  the  partial  of  fg^t  with  respect  to  w 


— aiii  :  f  .  1  -  f  .f 

dwi  0^^  I  j  i"-  ^ 


Similarly,  the  partial  of  fout  respect  to  the 

same  form  and  is  expressed  as: 


— z  f  .  1  -  f 


As  for  the  time  derivative  of  the  sigmoid  function,  it  is 
assumed  that  both  the  inputs  and  weights  are  functions  of  time. 
Therefore,  the  derivative  of  the  sigmoid  with  respect  to  time 
will  again  be  a  partial  over  all  the  weights  and  inputs  of  the 
cell,  such  that 


atV  1  ♦  exp(— ‘W) 
in 


—  (  1  ♦  exp(— -w)  )  1 

in 


-1  •  (  i  ♦  expi-f"^  -w)  )~2. —  expf-f"^  ‘W) 

in  in 


exp(-fin'W)  d  -p 


1  ♦  expl-f"^  -w)  1  ♦  exp(— ‘W)  at 
in  in 


— (-f in’»> 


1  -  f 


-  ‘W 
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From  here  the  chain  rule  is  applied  to  obtain  the  following: 


—ftiil  =  f  .(  1  - 


r  dw  df 

f  )  •  f  • —  ♦  - 

out  1  in  at 


The  above  may  be  rewritten  by  applying  Eqs.  C.l  and  C.2,  where 


afout  ^^out  ^*in 

- - -  ♦  - - - , 

aw'T  at  af'^  at 

in 


The  second  order  partials  are  a  straight  forward  extension 
of  the  first  order  partials.  In  fact,  Eqs.  C.l  and  C.E  will  be 
used  in  the  computation  of  the  second  order  partials.  To  begin, 
the  partial  of  fout  with  respect  to  w  and  w  transpose  is 
considered. 


a  r  af 


awaw"^  aw  I  aw"^ 


—  f  •  (  1  -  f  ) 

out  out  in 


af  af 

- flJJLL,  (  1  —  f  )  —  f  • - •f'^ 


f  •(  1  -  f  )2*f  -  (f  )^*(  1  -  f  )>f 

out  out  in  out  out  in  j  in 


f  -f 


f  Ml-t  )•  (1-f 

out  out  out 


out  in  in 


I 


out  T 

— ~  -  ^out*<  ^  ~  ^out  I'f  ^  ~  2*fout  1‘^in‘^in 


dwdw^ 


Similarly,  the  second  partial  of  fout  respect  to  and 

fin  transpose  is  found  to  be: 


df  dfT 
in  in 


=  ^out*(  i  -  ^out  )•(  ^  -  2‘fout  )‘W*wT. 


Another  second  partial  used  in  the  formulation  of  the  second 
order  algorithm,  is  the  second  partial  of  font  with  respect  to  w 
and  fin  transpose.  Again  the  use  of  Eqs.  C.l  and  C.2,  and  the 
application  of  the  chain  rule  is  desired,  to  obtain 


d  f  di 


dwdfT  dw[  dfT 
in  in 


—  f  • (  1  -  f  )  •  wT 


(  1  -  f 


.^f.T 


♦  f  •  (  1  -  f 


Where  the  result  of  the  partial  of  w-^  with  respect  to  w  is  an 
(nxn)  matrix  and  is  denoted  as  I.  and  has  the  following  form: 


f 
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10  0...  0 
0  10...  0 

0  ...  O  0  1 


With  this  in  mind  the  final  form  of  the  second  partial  becomes; 


a2f 


out 


dwaf  T 
in 


f  ‘(l-f  )2‘f  -{f  ).f 


out 


out  in  out 


out  in 


,  wT 


*'  ^ out  *  <  ^  ^  out  )  *  ^ 


f  •  (  1  -  f  ) 
out  out 


(  1  -  f  )  -  f 

out  out 


•  f  • 

in 


*  ^ouf  (  1  ~  ^  out  )  *  I 


out 


a2f 


out 


awafT 

in 


=  ^out'f  ^  “  ^out  ^  —  2*fout  ^'^in’^"^ 


*  ^  out ‘  f  ^  ~  ^  out  *  ' ^ 


C.  6 


In  a  similar  fashion,  the  second  partial  of  fout  ^ith  respect  to 
fin  is  found  to  be 


•^^^out  T 

T  -  ^out'^  i  ~  ^out  ^  ~  2'^out  *’^‘^in 

af 

*  f  out  •  <  i  -  f  out  )  *  i 
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The  final  partials  to  consider  are  the  partials  of  fout 
respect  to  time  and  w  or 


d  r  df 
at  I  dw 


dwV  at 


aw  af  af 


awl  aw'T  at  af^ 

in 


by  applying  Eq.  C.  3. 


awl  at  j 

awaw"^ 

at  awafT 
in 

at 

=  ^out’f 

^  ^  out  )  *  f 

1  - 

♦  ^ouf 

(  1  ~  ^out  * 

•  (  1 

A  -P  ^4 

(  1  —  ^out  J 

af , 

*  *out* 

at 

Similarly, 

e  (  df  ^ 

_ _  out 

df  i.  dt  j 

in 


a^f  dw  a^f  df 

- aui. —  +  _ aul_. _ in 

df  awT  at  df  dfT  at 
in  in  in 


T  dw 

=  ^out*(  1  -  ■fout  1  -  2*fout  )‘’*f*fin*  — 

at 

rr,  in 

♦  ^ouf(  1  -  ^out  >•<  ^  -  2*fout  )‘WwT - 

at 

d^in 

*  ^OUt*(  ^  ~  ^out  ^ • ® 

at 


The  above  computations  provide  an  excellent  review,  as  well 
as  a  quick  reference  to  the  partial  derivatives  of  the  sigmoid 
function.  From  the  results,  it  is  readily  seen  why  the  sigmoid 
function  is  a  popular  nonlinear  transfer  function  used  by  the 
artificial  neural  network  ccuimumty.  All  of  the  partials  of  the 
sigmoid  are  functions  of  the  sigmoid  itself.  This  makes 
computations  very  simple  and  convenient  for  a  digital  computer. 
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Appendix  D:  Second  Order  Convergence  Conditions  for  a  Single  Cell 

Tlie  topic  of  this  appendix  concerns  some  ideas  for  improving 
the  network  convergence  to  an  optimum  set  of  weights.  More 

specifically,  the  questions  posed  are:  Can  the  network  begin 
its  training  routine  with  the  second  order  algorithm?  If  so, 
what  criterion  must  be  met  to  insure  that  the  network  will 

converge  on  an  optimum  set  of  weights?  Is  there  a  means  of 
improving  the  convergence  times?  The  first  section  discusses  the 
initial  state  of  the  network.  It  introduces  the  criterion  which 

must  be  met  in  order  to  begin  training  with  the  second  order 
algorithm.  The  next  and  final  section  entertains  the  idea  of 

accelerating  convergence  with  a  momentum  term.  The  following 
text  is  the  result  of  conversations  and  notes  taken  from 

interviews  with  Dr.  Mark  Oxley  [5]. 

D.l.  Initialize  Training  with  the  Second  Order  Algorithm 

To  begin  answering  the  above  questions,  a  simple  problem  is 
constructed  for  clarity.  Consider  a  single  layer  perceptron  that 
classifies  an  analog  input  vector  into  two  classes  denoted  l  and 

2,  see  Fig.  D.li 

The  single  cell  is  to  divide  the  space  spanned  by  the  input 

into  two  regions  separated  by  a  line  (  or  hyperplane  )  in  two 
dimensions.  Class  1  will  be  represented  by  a  desired  output  of 
1,  while  the  desired  output  for  class  2  is  O.  For  training 

purposes  it  is  desired  to  minimize  the  squared  error  function 
with  the  second  order  algorithm.  To  begin  training  with  a  second 
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Figure  D.l  Pictorial  Problem  Description 


order  algorithm,  it  is  desired  for  the  initial  weight  settings 
to  exist  within  the  basin  of  attraction  of  a  global  minimum  of 
the  squared  error  performance  surface.  In  this  context,  the 
global  minimum  is  defined  over  the  entire  training  ensemble  of 
input  vectors.  If  this  criterion  is  not  met,  the  path  towards 
the  optimum  set  of  weights  may  never  be  found  by  the  second  order 
algorithm. 

Let 


s  (w) 


1  K 


f(xJ,  W)  )2 
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where  s(w)  is  the  average  squared  error  over  all  k  input  vectors. 
For  the  problem  considered  here,  w  -  (Wj,  W2,  W3)  and  x  -  (Xj, 
xg,  1).  Keep  in  mind  that  the  concept  can  be  extended  to  a 
network  of  higher  dimensionality.  By  taking  the  first  partial 
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derivative  of  s  and  evaluating  at  tlie  optimum  weights  (w*),  it  is 
desired  that  a  minimum  exists,  and  preferably  equal  to  o. 
Assume  that  f(xJ,  w)  is  the  sigmoid  function,  so  that  the 
results  of  appendix  C  may  be  used  and 


as  (w) 


aw 


2  K 

r»  (  d  -  f  (xJ,  w)  )  • — 

f,  ^  Amtm 

K  J  =  1 


a  r  ^ 

d  -  f  (xJ,  w) 
awl 


2  K 


- ,  jn  {  d  _  f(xJ,  W) 


)  •  (  1  -  f  (X-j,  W)  )  'xj  D.  2 


J  =  1 


Where  it  is  desired  for 

as ( w* ) 

-  :  0, 

aw 

since  the  global  minimum  is  also  desired.  To  insure  a  minimum 
and  not  a  saddle  point,  the  second  partial  of  s  is  considered. 


d2s(w)  2  k 

- ;;r  =  — E 

a  1 

( 

a 

d  -  f  (xJ,  w)  - - 

d  -  f  (  xJ  ,  w  ) 

)) 

awawT  k  fi  ^ 

aw' 

aw*^ 

J  i 

2  ^ 

-•  v*  f  (xJ,  w)  •  {  1  -  f  (xJ,  w)  ) 

•  f  (xJ,  W)  •  (  I  -  f  (xJ.  W)  ) 

-  (  d  -  f(xJ.  w)  )'(  1  -  2*f(xJ,  W)  ) 

I 

•  X  J  •  (  xJ  D.  3 


The  expression  preceeding  the  matrix  is  a  scalar  for  given 


values  of  xJ  and  w:  therefore,  the  entire  expression  is  in  the 


form  of  a  matrix.  If  the  expression  is  determined  to  be  positive 
definite,  then  when  the  first  partial  of  s  is  evaluated  at  the 
optimum  weights  the  result  is  the  minimum  of  the  error  surface, 
and  ideally  zero. 

First,  it  must  be  determined  that  xJ'CxJ)'^  is  a  positive 
definite  matrix  or  not.  Therefore,  a  necessary  and  sufficient 
condition  for  the  real  symmetric  matrix  A  to  be  positive  definite 
[14:243-254]; 

(1)  y'^'A'y  >  O  for  all  nonzero  vectors  y. 

(2)  All  the  eigenvalues  of  A  satisfy  >  0. 

(3)  All  the  submatrices  An  have  postive  determinants. 

(4)  All  the  pivots  (without  row  exchanges)  satisfy  dj^  >  0. 

To  satisfy  criterion  (1)  consider  the  following: 

y'^«A*y  :  yT, xJ  •  ( xJ  y 

=  (  y'^*xJ  )•{  (xJ)'r*y  ) 

:  (  y'^*xJ  )2  >  o. 

The  above  result  shows  that  the  matrix  xJ*(xJ)'^  could  be 
positive  definite  or  positve  semi-def  mite  (implying  a  O 
eigenvalue).  Further  investigation  with  criterion  (2)  is 
necessary.  If  the  matrix  is  singular,  then  a  o  eigenvalue 
exists.  To  determine  the  singularity  of  a  matrix  consider  the 
determinant  of  A. 

Recall  that  xJ  represents  an  arbitrary  input  pattern,  where 
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each  pattern  will  be  considered  a  column  vector.  Each  component 
of  a  single  pattern  or  vector  is  denoted  as  x^.  Considering  a 
specific  input  vector  to  ease  the  notational  overhead,  and 
expanding  the  matrix  yields; 


and  hence  the 


X  ♦  X  X  •  X  X 
11  12  1 


X  ‘X  X  ‘X  X 
2  1  2  2  2 


X  X  1 

1  2 


1*1  = 

--  (Xi  )2.  (  (Xg  )2  -  {X2  )2 


—  x^'XgM  xg*xi  —  Xg^xj  ) 

♦  Xj* {  xi  • (Xg  )2  -  xi  • (Xg  )2  ) 


0. 


Since  the  matrix  is  singular,  it  is  positive  semi-def  inite.  At 
this  point,  the  test  for  a  global  minimum  is  inconclusive,  since 
it  is  entirely  possible  that  the  second  partial  evaluated  at  the 
optimum  weights  may  be  a  saddle  point.  Therefore,  it  is 
necessary  to  force  the  matrix  to  be  positive  definite  and  insure 
a  global  minimum  by  adding  a  scaled  quadratic  function  of  the 
weights  to  s{w),  such  that 

S(w)  --  s  (w)  +  e  •  (  w  -  w*  )T.  (  w  -  w*  ),  e  >  0.  D.  4 
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This  particular  quadratic  was  chosen,  such  that  when  the  first 
partial  derivative  is  taken  and  evaluated  at  w*.  the  partial  of 
the  quadratic  reduces  to  zero  under  ideal  conditions.  The  second 
partial  of  S(w)  evaluated  at  the  set  of  optimum  weights  (w*)  is 

d2s{w*)  d^sCw*) 

-  :  -  +  2*6*I  >  0 

dwaw*^  dwaw*^ 

and  positive  definite  if 

f  (xJ,  w)  .  (  1  -  f  (xJ.  w)  ) 

—  (  d  —  f(xJ,  w)  )•(  1  —  2*f(xJ,  w)  )  >0,  for  each  ,.!. 

Two  cases  m\ist  be  considered,  d  =  1  and  d  =  0.  For  d  =  0,  a  new 
function  of  the  output  may  be  described,  such  that 

Go(f )  :  f • {  2  -  3*f  ) 

for  a  particular  input  and  a  given  set  of  weights.  The  graph  of 
Go(f)  is  shown  below  in  Fig.  D.2.  Keep  in  mind  that  the  values 
of  the  sigmoid  function  are  continuous  over  the  range  (O,  l). 

This  implies  that  Go(f)  is  non-negative  over  0  S  f  5  2/3. 
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When  <1  =  1,  the  function  takes  on  the  following  form: 

Gi(f)  =  -3.f2  4  -  I 

The  quadratic  equation  provides  the  points  where  the  function 
crosses  the  zero  axis,  such  that  GjCf)  is  non-negative  over  the 
range  1/3  <  f  <  1.  The  graph  of  G^(f)  is  shown  below  in  Fig. 
D.3. 
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G  (f ) 
1 


0 

0  f  1 


Figure  D.3  Quadratic  Function  for  d 


1 


By  over  lapping  the  two  graphs  of  Figs.  D.2  and  D.3,  the 
range  of  values  f  can  assume  is  1/3  ‘  f  <  2/3.  If  the  output 
meets  this  initial  criterion,  then  the  second  partial  of  the 
squared  error  is  a  positive  definite  matrix.  The  criterion 
placed  on  the  initial  weight  values,  such  that  the  training 
begins  in  the  neighborhood  of  the  global  minimum  can  be  found  by 
rewriting  this  inequality.  This  is  accomplished  below. 

Consider  the  range  of  values  the  output  of  the  sigmoid 
function  may  take  on,  in  order  to  begin  training  within  the 
neighborhood  of  a  global  minimum;  such  that 

1  2 

-  <  f(x'^«w)  <  - 
3  3 

1  1  2 
—  <  -  5  _ 

3  1  ♦  exp(-x'^*w)  3 

3 

1  <  -  <  2 

1  ♦  exp(-x'r*w) 

1  ♦  exp(-xT.w)  <  3  <  2  +  2* exp(-xT.w). 

Consider  the  lower  bound,  where 

exp(-x'^*w|  <  2. 

The  lower  bound  for  the  weighted  sum  of  the  inputs  becomes: 

x^.w  >  -in  (2). 
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Tlie  upper  bound  is  found  in  a  similar  fashion, 

exp(— x'^‘W)  >  1 
xT.w  <  ln{2). 

The  final  criterion  becomes, 

-ln(2)  <  xT.w  <  ln(2). 

This  relationship  must  hold  over  the  entire  input  vector 
ensemble.  If  this  criterion  is  met,  then  it  is  insured  that 
training  will  begin  with  a  set  of  weight  values  in  the 
neighborhood  of  a  global  minimum  over  the  entire  input  ensemble. 
Therefore,  when  considering  the  entire  input  training  set, 

-ln(2 )  <  (xJ  <  ln(2  1,  ^  D.  5 

Results  of  this  relationship  suggest  that  the  optimal  weight 
values  are  bounded  in  weight  space  by  hyperplanes.  These 
hyperplanes  are  described  from  the  above  criterion,  if  x'^*w 
IS  allowed  to  equal  the  two  extremes,  -ln(2)  and  ln(2),  for  a 
specific  input.  The  result  is  a  hypercube  in  weight  space.  The 
ensemble  of  hypercubes  over  all  input  vectors  approximates  a 
sphere  in  weight  space  enclosing  the  optimal  weight  values  for 
the  corresponding  set  of  input  vectors. 

By  placing  some  restrictions  on  the  magnitude  of  the  input 
vectors  and  then  analyzing  the  criterion  of  Eq.  D.5,  the  initial 
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range  of  weight  values  may  be  randomly  chosen.  For  instance, 


assume  that  the  inputs  have  been  normalized  to  lie  in  the 
interval  (-1,  1).  By  restating  Eq.  D.5  in  the  following  manner; 


-ln(a  ) 


n 


•X 

I 


ln(a  ), 


where  n  ranges  over  all  inputs  to  the  cell.  With  the  above 
inequality,  consider  a  worst  case  condition:  assume  that  all  the 
inputs  are  all  equal  to  l  (or  -1).  This  condition  places  a 
further  restriction  on  the  initial  value  of  the  weights,  than 
imposed  by  the  above  inequality  of  Eq.  D.5,  such  that 


n 

-ln(2)  <  y  w  <  ln(2) 

til  ^ 


and 


-ln(2  ) 


ln(2  ). 
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The  above  inequality  provides  a  strict  criterion  for  the  Initial 
weight  values  of  each  cell.  For  applications  within  this  study, 
the  weights  are  randomly  set.  It  would  be  a  trivial  exercise  to 
perform  the  above  criterion  and  reset  the  weights  of  those  cells 
which  do  not  meet  the  inequality. 

Two  important  results  should  be  observed  from  the  discussion 


above.  First, 

If 

the  criterion  of  Eq. 

D.6  IS  met,  then 

a  the 

inequality 

of 

D.5 

is  met. 

This  implies 

that  the  second 

partial 

derivative 

of 

the 

squared 

error  is  a 

positive  definite 

matrix. 
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The  existence  of  the  positive  definite  matrix  implies  the 
existence  of  a  global  minimum  over  the  corresponding  input  set. 
Furthermore,  training  begins  within  some  neighborhood  of  the 
global  minimum.  Secondly,  randomly  setting  the  weights  to  very 

small,  negative  and  positive,  values  provides  a  near  zero  value 
for  the  sum  of  weighted  inputs.  This  implies  that  all  output 

nodes  fire  on  average,  near  0.5,  allowing  the  network  to  train 
and  drive  the  outputs  toward  desired  values.  If  the  nodal 
outputs  are  driven  towards  extremely  low  or  high  values 
initially,  the  network  has  a  very  difficult  task  of  driving  the 

network  towards  desired  values  in  the  opposite  direction. 

D.2.  The  Momentum  Term 

The  weighted  quadratic  function  added  to  the  squared  error 
term,  is  an  attempt  at  driving  the  second  partial  matrix  of  the 
squared  error  to  a  positive  definite  matrix.  However,  many 
researchers  desire  to  use  this  quadratic,  with  the  idea  of 
enhancing  convergence  times.  For  instance,  consider  Eq.  D.4  as 
the  function  to  minimize  during  training.  Reproducing  Eq.  D.4 

provides, 

1  k 

S(w)  :  *  d  —  f(xJ,  w)  |2  ♦  e«(  w  —  w*  w  —  w"^  ). 

k  j  .  1 

Using  a  first  order  technique,  the  weights  are  changing  according 
to  the  following  first  order  differential  equation: 


dw 


d 


_  ,  _  _B.s  .  €.(  w  -  w»  )T.(  w  -  w“  . 
at  awv  j 

Llppman  describes  ttie  discrete  counterpart  of  the  above 
differential  equation  [4:17].  By  computing  the  partial,  the 
weights  are  updated  by: 

dw  a • B  k 

-  :  - (  d  -  f(xj,  W)  )«f(xJ,  W)*{  1  -  f(xJ,  W)  )*xJ 

at  k  jil 

♦  a  •  €  •  {  w  —  w*  ) 

The  parameter  B  controls  the  convergence  rate.  The  last  term. 
a*€*(w  -  w*),  IS  known  as  the  momentum  term  and  first 

introduced  by  Rummelhart,  Hinton,  and  Williams  [13].  An 
ariticle  by  Lippman  describes  the  momentum  term  as  possibly 
improving  convergence  times  [4:17].  In  the  derivation  of  the 
second  order  approximation,  Parker  introduces  the  momentum  term 
by  means  of  leakage  terms  [7:593-600;  8].  Parker  believes  the 
leakage  terms  insure  convergence. 

In  the  last  section,  conditions  were  established  such  that 
the  network  could  begin  training  with  a  second  order  algorithm. 
However,  what  happens  to  the  positive  definite  matrix  of  the 
second  partial  of  S  as  the  network  trains?  To  maintain  the 
status  of  a  positive  definite  matrix,  there  must  be  some 
condition  placed  on  €.  In  this  section,  it  is  desired  to  find 
the  optimal  momentum  scalar,  €,  insuring  that  the  search  is 
always  being  performed  near  the  global  minimum. 
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Consider  Eq,  D.4,  wtiicti  defines  a  new  performance  surface, 

where 


S(w)  :  s(w)  ♦  €•{  w  -  w*  )T.(  w  -  w*  ) 


and  the  average  second  partial  of  S  becomes; 


a2s(w)  a  ^ 

-  :  -•y  f(xJ.  W).{  1  -  f{xJ.  w)  ) 

dwdwT  k  ft  I 

•  f  f  (xJ,  w)  ♦  {  1  -  f  (xJ,  w)  ) 


-(  d-f(xJ,  w)  )•(  l-a>f(xJ,  w)  ) 
:JmxJ)T  +  2*€.I. 


a(xj,  W)  :  f(xl,  W)*(  l-f(xJ,  W)  )•  f(xJ,  w)*(  l-f(xJ,  w 


(  d— f(xJ,  w)  )•(  1— 2*f(x 


J,  W )  )  1  , 


then 


d‘^S(w)  2  k 


-  -.51  a(xJ,  wi.xJMxj)"^  ♦  2.€-l. 

awaw^  k  ft  1 


Given  an  input  pattern  and  set  of  weights  a  becomes  a  scalar. 
Its  dependence  on  xJ  and  w  will  be  removed  for  convenience. 

To  remain  within  the  global  minimum,  the  matrix  above  must 
maintain  its  positive  definite  condition.  Again,  the  test  for  a 
positive  definite  matrix  is  applied,  such  that 
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a'xJMxJ)'^  +  2;e:I  )*y  >  0, 

for  all  j  :  1,  2,  ....  k  and  for  all  y  which  are  members  of 

So 

(  a*  yT.xJ  •  (  xJ  y  +  2;€:yT.y  j  . 

(  a-(  yT.xJ  )•(  (xJ)T:y  )  +  2;e:yT.y  )  : 

(  a«(  (xJ)T.y  )T.(  (xJ)T:y  )  +  2 ; € : yT . y  ) 

that  IS,  we  wish 

a*(  (xJ)'^*y  )2  ♦  2*€*y'^‘y  >  o  D.  7 

Two  cases  must  be  considered  along  the  way  to  insure  that  the 
above  inequality  is  met.  First,  given  an  a  >  0  and  an  xJ,  is 

there  a  condition  on  €  such  that  the  inequality  is  met?  Yes,  €  > 
O,  where  €  is  independent  of  a  and  xJ.  The  second  case  is  not  so 
trivial.  For,  given  an  a  <  0  and  an  xJ,  what  condition  is  placed 
on  e,  where  £  -  €(a,  xJ)?  Rewriting  Eq.  D.7, 

-a  •  {  {  xJ  )'^«  y  )  2 

e  )  - ,  D.  8 

2 • yT . y 

Recall  that  y  is  a  nonzero  vector.  Figures  D,4  and  D.5  are  plots 
of  a  as  a  function  of  the  output  (f)  for  values  of  d  r  0  and  d  : 
1  respectively. 
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Figure  D.5  a(f)  for  ci:l 


For  a  value  of  €  independent  of  xJ  and  a,  the  most  negative 
value  of  a  is  desired.  From  Figs.  D.4  and  D.5  that  value  is  a  - 
-0.06.  It  is  also  desired  to  have  the  maximum  eigenvalue,  such 
that 


and 


-a  •  (  (  X  J  )  •  y 

yT.  y 
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A  :  X 


max 


where  the  eigenvalues  are  found  by 


or 


'•  I  xJ  •  (xJ  I'T  -  X  •  I  I  :  0 


X  'X  -  X  X  ‘X 

11  12 


X  •  X 
2  1 


1 


X  ‘X  -  X  X 
2  2 


2 

1  -  X 


for  a  particular  input  vector  j.  THe  above  determinant  produces 
a  cubic  in  X,  where 

x2.{  -X  ♦  (  xi^  ♦  xgS  ♦  1  )  )  :  0 

and 


^max  =  ♦  xgS  .  1. 


Finally,  €  is  estimated  as; 


€ 


1  ^  J 

a(xJ,  w)*(  (x^)2 


2 


). 


This  expression  is  for  the  specific  problem  described  above  in 
section  D.l.  In  general,  while  considering  the  minimum  a  (- 
0.06),  €  can  be  rewritten  as: 
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1  k 

f 

r  "  1 

•> 

€  >  0.  06*--y^ 

1 

1 

y\  (xJ)2  1 

+  1 

j  :  1 

1. 

1  1=1  ^  J 

From  the  above  inequality,  €  is  a  function  of  the  sum  of  the 
square  of  the  input  components  averaged  over  the  entire  ensemble 
of  input  vectors.  The  input  equal  to  1  is  analogous  to  a 
threshold. 

By  adding  the  quadratic  function  of  the  weights  to  the 
function  (squared  error)  being  minimized,  a  smoother  error 
surface  results.  It  is  believed  the  quadratic  will  (for  lack  of 
a  better  expression)  stretch  out  inflection  and/or  saddle  points, 
providing  a  smoother  upward  concavity.  If  the  region  in 
proximity  of  the  global  minimum  is  relatively  flat,  the 
quadratic  will  increase  convergence  times,  again  by  providing  an 
upward  concavity.  Thus  the  quadratic  removes  areas  within  the 
error  surface  which  may  slow  down  the  converging  process.  The 
expression  for  €  in  Eq.  D.9,  insures  that  this  condition  is 
maintained  throughout  training. 

One  question  remains  to  be  answered.  From  the  above 
discussions,  w*  is  the  optimum  vector  of  weights  used  to 
determine  the  minimum  of  the  error  surface.  So  what  value  is 
used  to  approximate  w*?  -  It's  the  best  estimate  of  the  previous 

weight  values,  approximated  by  the  weight  update  rule  being  used. 
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Appendix  E:  Further  Comparisons  with  the  Bayesian  Classifier 

This  appendix  provides  an  extended  comparison  of  the  neural 
net  classifier  and  Bayesian  classifier.  However,  the  criterion 
for  determining  correct  classification  has  been  altered  for  the 
neural  net  classifier.  In  chapter  five,  the  node  corresponding 
to  a  correct  choice,  had  to  fire  above  0.8.  while  all  other  nodes 
fire  less  0,2.  This  criterion  will  be  eased  a  bit  to  provide  a 
comparable  analysis  (if  possible).  Now,  the  node  corresponding 
to  a  correct  classification  must  fire  above  0.5  and  all  other 
nodes  below  0.5.  Table  E.l  provides  the  results  for  a  single 
pass  through  the  networK. 

Table  E.l  Overall  Classification  Accuracy.  (1)  Gradient  of 
Steepest  Descent,  (2)  Momentum  Method.  (3)  Second  Order  Method. 
(4)  Bayesian. 


- 1 

Overall  Accuracy 

- - - 1 

( 1 ) 

(2) 

- 1 

(3) 

- 1 

(4) 

Training  Data 

M  K 

91.  9X 

88.  77. 

74.  67 

Testing  Data 

If  K 

67.  8/. 

72.  47. 

75.  37 

The  above  table  represents  an  instance  during  a  typical 
training  session.  Again,  the  neural  net  classifiers  far  exceed 
the  performance  levels  of  the  Bayesian  classifier,  when  the 
training  data  is  considered.  However,  the  Bayesian  classifier 
has  a  sizable  edge  on  the  momentum  method  and  a  slight  edge  on 
the  second  order  method,  when  regarding  the  test  data.  It  is 
likely  that  the  neural  net  classifier  would  improve,  if  the 
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amount  of  information  is  increased.  For  instance,  by  increasing 


the  number  of  original  input  features.  The  neural  net  learns  by 
example,  the  more  information  the  net  has,  the  more  opportunity 
the  net  has  to  learn  it's  environment. 


i^>paviLx  F:  XQR  ^kxiel 


with  t«xt_io; 
with  int«g«r_t9zt.io; 
with  float.t«zt_io; 
with  float.math.lib; 
with  systam; 


use  tazt.io; 
us«  intagar.tczt.io; 
usa  float.taxt.io; 
USA  float .math. lib; 


with  MITH.LIB.EXTENSION; 
with  VECTOR.OPERITIOMS; 
with  ■omp.sapport ; 

procedure  SO.XOR  is 


use  MiTH.LIB.EXTENSION; 
use  VECrOR.OPERATlONS ; 
use  somp.support ; 


nuffl. inputs 

nuffl. LI .nodes 

nuffl.L2.nodes 

il 

A2 

A3 

A4 

AS 


integer: 
integer ; 
integer: 
float ; 
float ; 
float ; 
float; 
float ; 


total. error  ;  float; 

Iteration. Count  :  integer 

Convergence. Count  :  integer 

Convergence. Criterion  :  constant 

interval  :  integer ; 

Center  :  float; 

Width  :  float ; 

total. cost  :  float; 


«  0; 

•  0; 

=  0.1; 


Seed 

begin  — Main 


system. unsigned.longword  :=  MATH.LIB.EXTENSION.get.seed 


put  ("Enter  center  of  random  weight  distribution:  ");  get  (center): 
skip.line; 

put  ("Enter  width  of  random  weight  distribution:  ");  get  (width); 
skip.line; 


put  ("Enter  number  of  inputs:  ");  get  (num.inputs) ;  skip.line; 
put  ("Enter  number  of  LI  nodes:  ");  get  (num.Ll.nodes) ;  skip.line; 
put  ("Enter  number  of  L2  nodes:  ") ;  get  (num.L2.nodes) ;  skip.line; 
put  ("Enter  interval:  ");  get  (interval);  skip.line; 


put 

put 

put 

put 

put 


C'Entar  th« 
("Enter  the 
("Enter  the 
("Enter  the 
("Enter  the 


constant 

11 

constant 

12 

constant 

13 

constant 

14 

constant 

15 

");  get  (il);  skip.line 
");  get  (12);  skip.line 
");  get  (13);  skip.line 
");  get  (14);  skip.line 
"):  get  (15);  skip.line 


declare 


LI  :  layer  (  num.inputs,  num.Ll.nodes  ); 

L2  :  layer  (  num.Ll.nodes,  num.L2.nodes  ): 
Dottt  :  vector  (  1  . .  num.L2.nodes  ) ; 


begin 

— Initialize  network  parameters 

for  j  in  LI .W*range(2)  loop 
for  i  in  LI .W ’range (1)  loop 

uniform  (  center,  width,  seed,  Ll.W(i,  j)  ); 
Ll.Del.W(i,  j)  :«  0.0; 
end  loop; 

uniform  (  center,  width,  seed,  Ll.Theta(j)  ); 
Ll.Del.Theta(j)  ;*  0,0; 
end  loop; 

for  j  in  L2.W’range(2)  loop 
for  i  in  L2.W’range(l)  loop 

uniform  (  center,  width,  seed,  L2.W(i,  j)  ); 
L2.Del.W(i,  j)  0.0; 
end  loop; 

uniform  (  center,  width,  seed,  L2.Theta(j)  ); 
L2.Del.Theta(j)  :»  0.0; 
end  loop; 

while  Convergence.Count  I-  4  loop 

if  Iteration.Count  mod  2=0  then 
Ll.Fin(l)  :=  0.1; 
else 

LI. Find)  :=  0.9; 
end  if ; 

if  Iteration.Count  mod  4  <  2  then 
Ll,Fin(2)  :=  0.1; 
else 
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Ll.Fin(2)  :■  0.9; 

•nd  if; 

if  Iteration.Count  mod  4  *  0  or  Itoration.Count  mod  4  «  3  than 
Dottt(l)  :>  0.1; 

also 

Dout(l)  :■  0.9; 
and  if; 

forward.pass  (LI,  A3,  AS  ); 

L2.Fin  :«  LI. Font; 

L2.Fin_Priffla  LI .Fout.Prima; 

forward.pass  (  L2,  A3,  AS  ); 

L2.Etotal  :»  2.0  *  (  Dout  -  L2.Fout  ); 

for  i  in  L2. Fout.Prima ’ranga  loop 

L2.Etotal.Prima(i)  :*  -2.0  *  L2 .Fout.Prima(i) ; 
and  loop; 

f  total.arror  :*  sum.output.arror  (  L2.Etotal  ); 

if  abs  (  total.arror  )  <  Convargance.Critarion  than 
Convarganca. Count  :=  Convarganca. Count  +1; 
also 

I  Convarganca. Count  0; 

and  if ; 

backward.pass  (  L2,  A3,  AS  ); 

Ll.Etotal  computa.sum  (  L2.Eout  ); 

LI .Etotal.Prima  :=  computa.sum  (  L2.Eout_Primo  ); 

backward.pass  (  LI,  A3,  AS  ); 

total. cost  :=  0.0; 

updata.woights  (  L2,  total. cost,  Al,  A2,  A4  ); 
updats.waights  (  LI,  total. cost,  Al,  A2,  A4  ); 

updata.thrasholds  (  L2,  total. cost,  Al,  A2,  A4  ); 
updata.thrasholds  (  LI,  total. cost,  Al,  A2,  A4  ); 


■ 
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Itaration.Count  :■  Itaration.Coont  ♦  1; 

if  Itaration.Count  mod  interval  «  0  then 
nes.line ; 

for  i  in  L2 . Font ’ range  loop 

put  ("Error  ■  ");  put  (L2.Etotal(i)/2.0) ;  put  (  "  "  ); 

end  loop; 

put  ("Iteration  •  ");  put  (Itaration.Count);  neB.line(2); 
end  if; 

end  loop; 

put  ("Iterations  till  Convergence  ■  ");  put  (Itaration.Count); 
end; 


end  SO.XOR; 


_ * 

_ * 


Appendix  G:  ADA  Programming  Model 


* 

it 
★ 

— *  This  program  is  a  computer  simulation  of  a  biological-based  * 
— *  neural  network,  applying  a  modified  backward  error  propaga-  * 
— *  tion  (BEP)  algorithm.  This  neural  network  model  was  * 

— *  developed  for  applications  in  pattern  classification.  The  * 
— *  modified  BEP  uses  a  minimization  technique  based  on  an  * 

— *  approximation  to  a  second  order  Newton's  method.  This  * 

— *  algorithm  takes  advantage  of  second  order  derivatives  (of  * 
— *  the  surface  to  be  minimized),  as  well  as  first  order  deriva-  * 
— *  tives.  Time  derivatives  of  the  signals  propagating  through  * 
— *  the  network  are  also  used  in  updating  the  network  weights.  * 
— *  Below  is  the  main  procedure.  Second  Order  Multilayer  * 

— *  Perceptron  (SOMPl.ADA)  written  in  the  ADA  programming  * 


— *  environment.  * 

~m^it  it 

— *  Model  implemented  by:  Capt  Clark  Piazza,  USAF  * 

_ *  * 


with  system; 

with  text_io; 

use 

text_io ; 

with  float  text  io; 

use 

f loat_text_io ; 

with  integer  text  io; 

use 

integer_text  io; 

with  f loat_math_lib 

use 

f loat_math_lTb ; 

with  somp_io; 

use 

somp_io; 

with  somp  support; 

use 

somp_support ; 

with  vector  operations; 

use 

vector  operations; 

with  math_lTb_extension; 

use 

math_lTb_ex tens ion 

procedure  sompl  is 

center 

float ; 

width 

float ; 

num  Ll  nodes 

integer ; 

num  L2  nodes 

integer ; 

num_classes 

integer ; 

num_tr_patterns 

integer ; 

• 

num  te  patterns 

integer ; 

num  moments 

integer ; 

A1 

float ; 

A2 

float ; 

A3 

float ; 

A4 

float ; 

A5 

float ; 

—  Testing  and  training  files  containing  pattern  feature  vectors. 

—  The  file  containing  the  features,  must  of  type  string  16 

—  characters  long. 
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te__list  ;  string  (  1  . .  16  ) ; 

tr”list  :  string  (  1  . .  16  ) ; 

seed  :  system. unsigned_longword  math_lib_extension.get_seed; 


output_error 
error_tole ranee 
total_cost 
total  distance 
avg  distance 
maxT_index 
max2  index 
max_Tte rat ions 
interval 
num_iterations 
te_count 
tr_count 
tr_num_correct 
te  num  correct 


float; 
float; 
float; 
float; 
float; 
integer ; 
integer ; 
integer ; 
integer ; 
integer; 
integer ; 
integer ; 
float ; 
float ; 


tr_accuracy 
te_accuracy 
tot_tr_er ror 
tot_te_error 
avg_e  r  r_pe  r_pat 
num_passes 
tot_passes 
num_points 

convergence 

tr_error_f ile 
te_error_f ile 
tr_accuracy_f ile 
te_accuracy_f ile 


:  float; 

;  float; 

;  float; 

:  float; 

:  float; 

;  integer  1; 

;  integer; 

;  integer; 

:  boolean; 

:  text_io. f ile_type; 
:  text_io. f ile_type; 
;  text_io . f ile_type ; 
;  text_io. f ile_type; 


—  Begin  main  procedure, 
begin 

—  Enter  the  following  from  the  terminal  or  create  a  com  file. 

put  (  "Enter  center  of  random  weight  distribution  (center), 
type  float:  "  );  get  (  center  );  skip_line; 
put  (  "Enter  width  of  random  weight  distribution  (width), 
type  float:  "  ) ;  get  (  width  );  skip_line; 
put  (  "Enter  number  of  layer  one  nodes  (num  Ll_nodes), 
type  integer:  "  )  ;  get  (  num__Ll_nodes  );  sT<ip_line; 
put  (  "Enter  number  of  layer  two  nodes  (num  L2_nodes), 
type  integer;  "  )/  get  (  num_L2  nodes  );  sl<ip_line; 
put  (  "Enter  number  of  output  no3es  ( num_classes ) , 
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type  integer:  "  );  get  (  num_classes  );  skip_line; 
put  (  "Enter  number  of  training  patterns  (num_tr_patterns ) , 
type  integer:  "  );  get  (  num_tr_pat terns  );  skip_line; 
put  (  "Enter  number  of  testing  patterns  (num_te  patterns), 
type  integer:  "  );  get  (  num_te_patterns  );  skrp_line; 
put  (  "Enter  number  of  moments  per  pattern  (num  moments), 
type  integer:  "  );  get  (  num  moments  );  skip_lTne; 
put  (  "Enter  training  moment  ?ata  file  (tr  list), 
type  string:  "  ) ;  get  (  tr_list  ) ;  skip_lTne; 
put  (  "Enter  testing  moment  data  file  ( te  list), 
type  string:  "  );  get  (  te_list  );  skip_Tine; 
put  (  "Enter  number  of  separate  training  passes  ( tot_passes ) , 
type  integer:  "  );  get  (  tot  passes  );  skip_line; 
put  (  "Enter  maximum  number  ol  iterations  (max  iterations), 
type  integer:  "  );  get  (  max_iterations  );  skTp_line; 
put  (  "Enter  output  interval  to  examine  results, 
type  integer:  "  );  get  (  interval  );  skip_line; 
put  (  "Enter  error  tolerance  ( er ror_tolerance ) , 
type  float:  "  ) ;  get  (  er ror_tolerance  );  skip_line; 

—  Enter  the  desired  learning  parameters.  Al  controls  convergence 

—  for  first  order  method.  A2  and  A3  control  the  amount  of  noise 

—  induced  into  the  network.  A4  controls  the  amount  of  momentum. 

—  A5  is  a  convergence  term  controlling  the  second  derivative 

—  information. 

put  (  "Enter  constant  Al,  type  float:  "  );  get  (  Al  ) ;  skip_line; 

put  (  "Enter  constant  A2,  type  float:  "  );  get  (  A2  ) ;  skip_line; 

put  (  "Enter  constant  A3,  type  float:  "  );  get  (  A3  ) ;  skip_line; 

put  (  "Enter  constant  A4,  type  float:  "  );  get  (  A4  );  skip_line; 

put  (  "Enter  constant  A5,  type  float:  "  );  get  (  a5  ) ;  skip_line; 

create  (  tr_er ror_f ile ,  out_file,  "tr_error.dat"  ); 
create  (  te_er ror_f ile ,  out_file,  "te  error.dat"  ) ; 
create  (  tr_accuracy_f ile,  out_file,  '"^tr_accuracy.dat"  ) ; 
create  (  te_accuracy_f ile ,  out_file,  "te_accuracy.dat"  ) ; 


—  Declare  network  layer  variables, 
declare 

LI  :  layer  (  num_moments,  num_Ll  nodes  ) ; 

L2  :  layer  (  num_Ll_nodes ,  num_L^_nodes  ) ; 

L3  :  layer  (  num_L2_nodes ,  num_classes  ) ; 

Dout  :  vector  (  1  . .  num_classes  ) ; 

training_ar ray  :  matrix  (  1  ..  num_tr_patterns ,  1  ..  num_moments  +  1  ) 

testing_ar ray  :  matrix  (  1  ..  num_te_patterns ,  1  ..  num_moments  +  1  ) 

—  Interval  must  be  some  multiple  of  max_iterations . 

data_points  :  integer  (  raax_iterations  /  interval  )  +  1; 

avg_tr_error  :  vector  (  1  ..  data_points  ); 

avg_te_error  :  vector  (  1  ..  data_points  ) ; 

avgtracc  :  vector  (  1  . .  data  points  ) ; 


avg__te_acc  :  vector  (  1  . .  data_points  ) ; 

tr_acc_array  :  matrix  (  1  ..  tot_passes,  1  ..  data_points  ); 

te_acc_array  :  matrix  (  1  ..  tot_passeS/  1  ..  data_points  ); 

tr_err_array  :  matrix  (  1  ..  tot_passes,  1  ..  data_points  ); 

te_err_array  ;  matrix  (  1  ..  tot_passes,  1  ..  data_points  ); 

—  The  parameters  below  may  be  used  to  measure  the  amount  of  change 

—  of  the  weights  and  thresholds,  via  cost. 

Ll_weights  :  constant  natural  :«  num_moments  *  num_Ll_nodes ; 

Ll_thresholds  :  constant  natural  :«  num_Ll_nodes ; 

L2_weights  ;  constant  natural  num_Ll_nodes  *  num_L2_nodes ; 

L2_thresholds  :  constant  natural  :»  num_L2_nodes ; 

L3_weights  :  constant  natural  :■  num_L2_nodes  *  num_classes; 

L3_thresholds  :  constant  natural  :«  num_classes; 

tot_parameters  :  constant  float 

float(  Ll_weights  +  Ll_thresholds  + 

L2_weights  +  L2_thresholds  + 

L3_weights  +  L3_thresholds  ) ; 

—  Begin  declare  block, 
begin 

—  Get  training  and  testing  moments,  store  into  an  array. 

get_moment_ar ray  (  tr_list,  num_moments,  training_ar ray  ); 
get_moment_ar ray  (  te_list,  num_moments,  testing_ar ray  ) ; 

—  Initialize  and  train  network  a  predetermined  number  of  times 

—  and  average  network  performance. 

while  num_passes  <-  tot_passes  loop 

—  Initialize  network  variables. 
initialize_network  (  Ll,  L2,  L3,  center,  width,  seed  ); 

—  Begin  training. 

num_points  1; 

num_iterations  ;>•  0; 
convergence  false; 

while  nura  iterations  <■  max_i te rations  loop 

—  convergence  •«  false  an3 

generate_random_moms  (  num_tr_patterns ,  num_moments, 

training_ar ray ,  Ll.Fin,  Dout,  seed  ); 

—  Begin  forward  pass  through  network. 
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compute_f orward_pass  (  Ll,  L2,  L3,  A3,  A5  ) ; 

-  Compute  output  error  and  time  derivative  of  error  for  each 

-  output  node. 

L3.Etotal  2.0  *  (  Dout  -  L3.Fout  ); 

for  i  in  L3 . Fout_Prime ' range  loop 

L3 . Etotal_Prime ( i )  :«  -2.0  *  L3 . Fout_Pr ime ( i ) ; 
end  loop; 

-  Compute  sum  of  output  errors,  sum  and  average  over  all 

-  input  patterns,  and  test  with  error  tolerance  for  convergence. 

output_error  sum_output_er ror  (  L3.Etotal  ) ; 
if  num_i terations  mod  num_tr  patterns  /-  0 

or  num_Tterations  «  0  then 
avg_err_per_pat  avg_er r_per_pat 

+  (  output_er ror  /  float(num  tr_patterns  )  ); 
elsif  avg_er r_per_pat  <  error_toTerance  then 
convergence  true; 

else 

convergence  false; 
end  if; 

Begin  backward  pass  through  the  network  one  layer  at  a  time. 

backward_pass  (  L3,  A3,  A5  ); 

L2.Etotal  :«  compute_sum  (  L3.Eout  ); 

L2 .  Etotal_Prime  corapute_surn  (  L3 . Eout_Prime  ); 

backward_pass  (  L2,  A3,  A5  ); 

Ll.Etotal  :■  corapute_sum  (  L2.Eout  ); 

Ll . Etotal_Pr ime  compute_sum  {  L2 . Eout_Pr ime  ); 

backward_pass  (  Ll,  A3,  A5  ); 

total_cost  :■  0.0; 

update_weights  (  L3,  total__cost,  Al ,  A2,  A4  )  ; 
update_weights  (  L2,  total_cost,  Al ,  A2 ,  A4  ) ; 
update_weights  (  Ll,  total_cost,  Al ,  A2,  A4  ) ; 

update_thresholds  (  L3,  total_cost,  Al ,  A2 ,  A4  ) ; 
update_thresholds  {  L2,  total_cost,  Al ,  A2 ,  A4  ) ; 
update_thresholds  (  Ll,  total_cost,  Al ,  a2 ,  A4  ); 

—  Used  to  measure  network  parameter  changes. 

—  avg_distance  :=  sqrt(  total_cost  )  /  tot_parameters 


—  Compute  network  performance. 

if  num_i terations  mod  interval  ■  0  then 

—  Check  training  data  performance. 

tot_tr_error  0.0; 

tr_num_cor rect  0.0; 
tr_count  ;■  1; 

while  tr_count  <-  num_tr_patterns  loop 

genet ate_seq_moms  (  tr_count,  num  moments,  training_array , 

Ll.Fin,  Dout  T; 

compute_f orward_pass  (  Ll,  L2,  L3,  A3,  A5  ); 

f ind_max_vals  (  L3.Fout,  maxl_index,  max2_index  ); 

tr_num_cor rect  :■  tr  num_correct 

+  float  (  correct  T  maxl_index,  max2  index, 

L3.Fout,  Dout  )  T; 

L3.Etotal  2.0  *  (  Dout  -  L3.Fout  ); 
output_error  sum_output_er ror  (  L3.Etotal  ) ; 

tot_tr_error  tot_tr_error  +  output_error ; 

tr_count  tr_count  +  1; 
end  loop; 

tr_acc_array  (  num_passes,  num_points  ) 

coinpute_ratio  (  tr_num_correr c,  num_tr_patterns  )  ; 

tr_err_array  (  num_passes,  nura_points  ) 

compute_ratio  (  tot_tr_error ,  num_tr_patterns  ); 

—  Check  test  data  performance. 

tot_te_error  0.0; 

te_num_cor rect  0.0; 

te_count  1; 

while  te_count  <-  num_te_patterns  loop 

gene  rate_seq_moms  (  te  count,  nu»n_moments ,  testing_array , 

Ll.Fin,  Dout  ) ; 

compute_f orward_pass  (  Ll,  L2,  L3,  A3,  A5  ); 
find  max  vals  (  L3.Fout,  maxi  index,  max2  index  ); 
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te_num_correct  te  num_correct 

+  float  (  correct  T  maxl_index,  max2  index, 

LS.Fout,  Dout  )  T; 

L3.Etotal  2.0  *  (  Dout  -  L3.Fout  ); 
output_error  sum_output_error  (  L3.Etotal  ); 

tot_te_error  tot_te_arrot  +  output_error ; 

te_count  te_count  +  1; 

end  loop; 

te_acc_array  (  num__passes ,  num_points  ) 

:■  compute_ratio''(  te_num_correct ,  num_te_patterns  ); 

te_err_array  (  num_passes,  num_points  ) 

compute_ratio  (  tot_te_er ror ,  num_te_patterns  ); 

num_points  num_points  +  1; 

end  if; 

num_iterations  num_i te rations  +  1; 

end  loop; 

nuni_passes  ;■  num_passes  +  1; 
end  loop; 

—  Compute  average  network  performance. 

avg_tr_error  compute_average  (  tr_er r_ar ray ,  tot_passes  ) 

avg_te_error  compute_average  (  te_er r_ar ray ,  tot_passes  ) 

avg_tr_acc  compute_average  (  tr_acc_ar ray ,  tot_passes  ) 

avg_te_acc  compute_average  (  te_acc_ar ray ,  tot_passes  ) 

—  Store  average  network  performance  in  matrixX  format. 

store_net_perf  (  avg_tr_er ror ,  tr_er ror_f ile ,  interval  ); 
store_net_perf  (  avg_te_error ,  te_error_f ile ,  interval  ); 
store_net_pe r f  (  avg_tr_acc,  tr_accuracy_f ile ,  interval  ); 
store_net_perf  (  avg_te_acc,  te_?ccuracy_f ile ,  interval  ); 

—  Close  all  files. 

close  (  tr_er ror_f ile  ); 
close  (  te_error_f ile  ); 
close  (  tr_accuracy_f ile  ); 
close  (  te_accuracy_f i If:  ); 
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I 

—  End  declare  block, 
end; 

m  —  End  main  procedure, 
end  sompl ; 


i 


■ 


I 
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with  system; 
with  text  io; 


use  text  io; 


with  vector  operations;  use  vector  operations; 

with  math_lTb_extension;  use  math_lTb_extension; 

package  somp_support  is 

type  layer  (  inputs  :  positive;  outputs  ;  positive  )  is 
record 

—  Network  Parameters 

Fin  :  vector  (  1  . .  inputs  ) ; 

Fin_Prime  :  vector  (  1  . .  inputs  ) ; 

W  :  matrix  (  1  ..  inputs,  1  ..  outputs  ); 

Del_W  :  matrix  (  1  ..  inputs,  1  ..  outputs  ); 

Theta  :  vector  (  1  . .  outputs  ) ; 

Del_Theta  :  vector  (  1  . .  outputs  ) ; 

Pout  :  vector  (  1  . .  outputs  ) ; 

Fout_Prime  :  vector  (  1  ..  outputs  ); 

Eout  :  matrix  (  1  ..  inputs,  1  ..  outputs  ); 

Eout_Prime  ;  matrix  (  1  ..  inputs,  1  ..  outputs  ); 

Etotal  :  vector  (  1  . .  outputs  ) ; 

Etotal_Prime  ;  vector  (  1  ..  outputs  ); 

—  Temporary  variables 

X  ;  vector  (  1  . .  outputs  ) ; 

V  ;  vector  (  1  . .  outputs  ) ; 

U  ;  vector  (  1  . .  outputs  ) ; 

Q  :  vector  (  1  ..  outputs  ); 

R  :  vector  {  1  . .  outputs  ) ; 

end  record; 


function  sigmoid  (  input  :  vector  )  return  vector; 

function  compute_sum  (  input  :  matrix  )  return  vector; 

function  sura_output_er ror  (  input  :  vector  )  return  float; 

function  correct  (  indexl,  index2  :  integer; 

output,  desired  ;  vector  )  return  integer; 

function  compute_ratio  (  numerator  :  float  ; 

denominator  ;  inceger  )  return  float 

function  compute_average  (  perf_array  :  matrix; 

tot_passes  ;  integer  )  return  vector 
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procedure  initialize_network  (  Ll,  L2,  L3  :  in  out  layer; 
center  :  in  float; 

width  :  in  float; 

seed  :  in  out  system. unsi9ned_lon9word  ); 

procedure  forward_pass  (  L  ;  in  out  layer; 

A3 ,  A5  :  in  float  ) ; 

procedure  compute_forward_pass  (  Ll,  L2,  L3  :  in  out  layer; 

A3 ,  A5  :  in  float  ) ; 

procedure  backward_pass  (  L  :  in  out  layer; 

A3 ,  AS  :  in  float  ) ; 

procedure  update_wei9hts  (  L  :  in  out  layer; 

cost  :  in  out  float; 

Al ,  A2,  A4  :  in  float  ); 

procedure  update_thresholds  (  L  ;  in  out  layer; 

cost  :  in  out  float; 

Al,  A2,  A4  :  in  float  ); 

procedure  f ind_max_vals  (  output  :  in  vector; 

indexl  :  in  out  integer; 
index2  ;  in  out  integer  ) ; 


end  somp_support ; 
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with  float_math_lib;  use  float_math_lib; 

package  body  somp_suppoct  is 

function  sigmoid  (  input  :  vector  )  return  vector  is 

output  :  vector  (  input 'range  ); 

begin 

for  i  in  input' range  loop 
begin 

output(i)  :■  1.0  /  (  1.0  +  exp  (  -input(i)  )  ); 
exception 

when  FLOOVEMAT  ->  output(i)  0.0; 
end; 

end  loop; 
return  output; 
end  sigmoid; 


function  compute_sum  (  input  :  matrix  )  return  vector  is 
total  :  vector  (  input ' range ( 1 )  )  (  others  ->  0.0  ); 

begin 

for  i  in  input ' range ( 1 )  loop 
for  j  in  input' range( 2 )  loop 

total(i)  total(i)  +  input(i,  j); 
end  loop; 
end  loop; 

return  total; 

end  compute_sum; 


function  sum_output_error  (  input  ;  vector  )  return  float 

total  :  float  0.0; 

begin 

for  i  in  input' range  loop 

total  (  total  +  abs(  input(i)  )  )  /  2.0; 
end  loop; 

return  total; 

end  sum_output_er ror ; 
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function  correct  (  indexl,  index2  :  integer; 

output,  desired  :  vector  )  return  integer  is 

update  :  integer; 

begin 

if  output ( indexl )  >-  0,5  and  output ( index2 )  <  0.5 

and  desired( index! )  -  1.0  then 


update  :«  1; 
else 

update  0; 
end  if; 

return  update; 
end  correct; 


function  compute_ratio  (  numerator  ;  float; 

denominator  ;  integer  )  return  float  is 


quotient  :  float; 
begin 

quotient  :■  numerator  /  float(denominator ) ; 
return  quotient; 
end  compute_ratio ; 


function  compute_average  (  perf_array  ;  matrix; 

tot_passes  :  integer  )  return  vector  is 


temp  :  float; 

perf_vector  ;  vector  (  perf_array' range(  T. )  ); 
begin 

for  j  in  perf_ar ray ' range ( 2 )  loop 
temp  0.0; 

for  i  in  perf_ar ray ' range ( 1 )  loop 
temp  temp  +  perf_ar ray( i ,  j); 

end  loop; 

perf_vector ( j )  temp  /  f loat( tot_passes ) ; 
end  loop; 
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return  perf_vector; 
end  coinpute_average; 

procedure  ini tialize_networlv  (  Ll,  L2,  L3  :  in  out  layer; 
center  :  in  float; 

width  :  in  float; 

seed  :  in  out  system. unsigned_longword  )  is 


begin 

for  j  in  Ll .W' range( 2 )  loop 
for  i  in  Ll .W' range( 1 )  loop 

uniform  (  center,  width,  seed,  Ll.W(i,  j)  ); 
Ll.Del_W(i,  j)  0.0; 
end  loop; 

uniform  (  center,  width,  seed,  Ll.Theta(j)  ); 
Ll .Del_Theta( j )  0.0; 

end  loop; 

for  i  in  Ll. Fin' range  loop 
Ll . Fin_Pr ime { i )  0.0; 

end  loop; 

for  j  in  L2 .W' range ( 2 )  loop 
for  i  in  L2 .W' range ( 1 )  loop 

uniform  (  center,  width,  seed,  L2.W(i,  j)  ); 
L2.Del_W(i,  j)  0.0; 

end  loop; 

uniform  {  center,  width,  seed,  L2.Theta(j)  ); 
L2.Del_Theta( j )  0.0; 

end  loop; 

for  j  in  l3 .W' range( 2 )  loop 
for  i  in  L3 .W' range ( 1 )  loop 

uniform  (  center,  width,  seed,  L3.W(i,  j)  ); 
L3.Del_W(i,  j)  0.0; 

end  loop; 

uniform  (  center,  width,  seed,  L3.Theta{j)  ); 
L3 .Del_Theta( j )  :■  0.0; 
end  loop; 

end  initialize  network; 


procedure  forward_pass  (  L  :  in  out  layer; 

A3,  a5  ;  in  float  )  is 


Temp  :  float; 
begin 
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L.X  L.Pin  *  L.W; 

L.Fout  sigmoid  (  L.X  +  L. Theta  ); 

L.V  (  others  ■>  0.0  ); 

for  j  in  L.W'range(2)  loop 
for  i  in  L.W'range(l)  loop 

L.V(j)  L.V(j)  +  (  L.Fin(i)  *  (  A3  *  L.W(i,  j) 

+  A5  *  L.Del  W( i,  j )  )  ) 

+  (  L.Fin_PrTme( i )  *  L.W(i,  j)  ); 

end  loop; 

Temp  ;-  (  A3  *  L.Theta(j)  ) 

+  (  a5  *  L.Del_Theta( j )  ); 

L.V(j)  L.V(j)  +  Temp; 

L.U(j)  L.Fout(j)  *  {  1.0  -  L.Fout(j)  ); 

L. Fout_Prime( j )  L.V(j)  *  L.U(j); 

end  loop; 

end  f orward_pass ; 


procedure  compute_f orward_pass  (  Ll,  L2 ,  L3  :  in  out  layer; 

A3,  AS  :  in  float  )  is 

begin 

forward_pass  (  Ll,  A3,  AS  ); 

L2.Fin  Ll.Fout; 

L2.Fin_Prime  Ll . Fout_Pr ime ; 

forward_pass  (  l2,  a3 ,  AS  ); 

L3.Fin  L2.Fout; 

L3.Fin_Prime  L2 . Fout_Pr ime ; 

forward_pass  (  L3,  A3,  AS  ); 

end  compute_f orward_pass ; 


procedure  backward_pass  (  L  :  in  out  layer; 

A3,  AS  :  in  float  )  is 

begin 

for  j  in  L.W'range(2)  loop 

L.Q(j)  :«  L.U(j)  *  L.Etotal(j); 

L.R(j)  :»  L.U(j)  *  (  L.Etotal  Prime(j)  +  (  L.Etotal(j) 

*  (  1.0  -  2.TJ  *  L.Fout(j)  )  *  L.V(j)  )  ) 


for  i  in  L.W'range(l)  loop 

L.Eout(i,  j)  ;-L.Q(j)  *L.W(i,  j); 

L.Eout_Priine( i ,  j)  (  L.R(j)  *  L.W(i,  j)  ) 

+  (  L.Q(j)  *  (  A3  *  L.W(i, 
+  A5  *  L.Del  W( i,  j )  )  ) ; 


end  loop; 
end  loop; 


j) 


end  backward_pass ; 


procedure  update_weights  (  L  :  in  out  layer; 

cost  :  in  out  float; 

Al,  A2,  A4  :  in  float  )  is 


begin 


for  j  in  L.W'range(2)  loop 
for  i  in  L.W'range(l)  loop 

L.Del_W(i,  j);-((1.0-A4)*  L.Del_W  (i, 
-  (  A2  *  L.W(i,  j)  ) 

+  (  (  (  Al  *  L.Q( j)  )  +  L.R( j)  ) 

+  (  L.Q(j)  *  L.Fin_prime  (i)  ); 


j)  ) 

*  L . Fin( i  ) 


) 


L.W(i,j)  L.W(i,j)  +  L.Del_W(i,  j); 

cost  ;«  cost  +  (  L.Del_W(i,  j)  **  2  ); 

end  loop; 
end  loop; 

end  update_weights ; 


procedure  update_thresholds  (  L  :  in  out  layer; 

cost  :  in  cut  float; 

Al ,  A2,  A4  :  in  float  )  is 

begin 

for  i  in  L . Theta ' range  loop 

L.Del_Theta( i )  (  (  1.0  -  A4  )  *  L.Del  Theta(i)  ) 

-  (  A2  *  L.Theta(i)  T 

-»-  (  (  Al  *  L.Q(i)  )  +  L.R(i)  )  ; 

L.Theta(i)  L.Theta(i)  +  L.Del_Theta( i ) ; 

cost  cost  +  (  L.Theta(i)  **  2  ); 
end  loop; 
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end  update_thresholds ; 


procedure  f ind_max_vals  (  output  :  in  vector; 

indexl  :  in  out  integer; 
index2  :  in  out  integer  )  is 

maxi  :  float  :•  0.0; 
max2  :  float  :«•  0.0; 
tempi  :  integer; 
temp2  :  integer; 

begin 

for  i  in  output' range  loop 
if  output! i)  >■  maxi  then 
tempi  i; 

maxi  output!!); 

end  if; 
end  loop; 

for  i  in  output 'range  loop 

if  output!!)  >-  max2  and  output!!)  <  maxi  then 
temp2  i; 

max2  output!!); 

end  if; 
end  loop; 

indexl  ;■  tempi; 
index2  :«  temp2; 

end  find  max  vals; 


end  somp_support ; 
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use  text_io; 

use  f loat_text_io ; 

use  integer_text_io ; 


with  system; 
with  text_io; 
with  f loat_text_io ; 
with  integer_text_io ; 

with  vector  operations;  use  vector  operations; 

with  math_lTb_extension;  use  math_lTb_extension; 

package  somp_io  is 

procedure  get_moment_array  (  image_list  :  in  string; 

num_moments  :  in  integer; 
moment_array  :  in  out  matrix  ) ; 

procedure  get_f eature_ar rays  (  num_features  :  in  integer; 

filename  :  in  out  string; 

f eature_ar ray  :  in  out  matrix  ) ; 

procedure  generate_randora_raoms  (  num^patterns  :  in  integer; 

num_moments  :  in  integer; 
moment_array  :  in  matrix; 
input  ;  in  out  vector; 

Dout  ;  in  out  vector; 

seed  :  in  out  system. unsigned_longword  ) ; 

procedure  generate_seq_moms  (  count  ;  in  integer; 

num_moments  ;  in  integer; 
moment__ar ray  :  in  matrix; 
input  “*  :  in  out  vector; 

Dout  :  in  out  vector  ) ; 

procedure  store_net_perf  (  perf_vector  :  in  vecto"; 

perf_file  ;  in  out  text  io . f i le_type ; 
interval  :  in  integer  T; 

end  somp_io; 
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package  body  somp_io  is 

procedure  get_moment_ar ray  (  image_list  :  in  string; 

num_moments  :  in  integer; 
moment_array  :  in  out  matrix  )  is 

initial  :  integer  1; 

final  :  integer  0; 

temp_class  :  integer; 

num_vector  :  integer; 

image_name  :  string  (  1  ..  14  ); 

image_file  :  text_io . f ile_type ; 

Temp_nanie  :  text_io.  f  ile_type; 

begin 

open  (  image_file,  in_file,  image_list  ); 

while  not  end_of  file  (  image_file  )  loop 
get  (  image_fiTe,  image  name  ); 
open  (  temp_name,  in_fiTe,  image_name  ); 
get  (  temp_name,  temp_class  ); 
get  (  temp_name,  num_vector  ); 

final  :■  final  +  num_vector; 

for  i  in  initial  . .  final  loop 
skip  line  (  teipp_narae,  3  ); 
for  J  in  2  . .  (  num  moments  +  1  )  loop 
moment_ar ray ( i ,  iT  :«  float ( temp_class ) ; 
get  (  temp_name,  moment_ar ray ( i ,  j)  ); 
end  loop; 
end  loop; 

close  (  temp_name  ) ; 
initial  :■  initial  +  num_vector; 
end  loop; 

close  (  image_file  ); 
end  get_moment_ar ray ; 


procedure  get_feature_ar rays  (  num_features  :  in  integer; 

filename  :  in  out  string; 

feature_array  :  in  out  matrix  )  is 

counter  ;  integer  1; 

temp_class  :  integer; 

target  :  string  (  1  ..  4  ); 
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input_file  :  text_io. f ile_type; 
begin 

open  (  input  file,  in_£ile,  filename  ); 
while  not  en3_of_file  (input_file)  loop 

get  (  input_file,  temp_class  ); 
get  (  input_file,  target  ); 
skip_line  (  input_file  ); 

feature_ar ray ( counter ,  1)  :■  float{  temp_class  ); 

for  j  in  2  . . (  num_features  +  1  )  loop 

get  (  input_file,  feature_array( counter ,  j)  ); 
end  loop; 

skip_line  (  input_file,  2  ); 
counter  counter  +  1; 

end  loop; 

close  (  input_file  ); 
end  get_f eature_arrays ; 

procedure  generate_random_moms  (  num  patterns  :  in  integer; 

num_moments  :  Tn  integer; 
moment_array  :  in  matrix; 
input  :  in  out  vector; 

Dout  :  in  out  vector; 

seed  :  in  out  system. unsigned_longword  )  is 

pick  :  float; 

temp_class  :  integer; 
choice  :  integer; 

begin 

Dout  :»  (  others  ->  0.0  ); 
uniform  (  0,5,  0.5,  seed,  pick  ); 

choice  integer  (  pick  *  f loat( num_patterns )  +  0.5  ); 
if  choice  <  1  then 
choice  1; 

elsif  choice  >  num_patterns  then 
choice  num_patterns ; 

end  if; 
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teiBp_class  :*  integer  (  moment_a c ray ( choice ,  1)  ); 

for  j  in  2  . .  (  nuffi_moments  -t-  1  )  loop 
input(j  -  1)  moment_ar ray ( choice,  j); 
end  loop; 

Douti,  temp_class)  1.0; 
end  generate_random_moms ; 


procedure  generate_seq_moms  (  count 

nuin_flioaierts 

moment_array 

input 

Dout 


:  in  integer; 
in  integer; 
in  matrix; 
in  out  vector; 
in  out  vector  )  is 


temp_class  :  integer; 
begin 


Dout  :■  (  others  ->  0.0  ); 

temp_class  ;■  integer { moment_ar ray ( count ,  1)); 


for  j  in  2  . .  (  num_moments  +  1  )  loop 
input (j  -  1)  ;■  moment_array( count,  j); 
end  loop; 


Dout ( temp_class )  1.0; 

end  generate_seq_moms ; 


procedure  store_net_per f  (  perf_vector  :  in  vector; 

perf_file  :  in  out  text  io.file_type 
interval  :  in  integer  T  is 

temp_interval  ;  integer  0; 

begin 

put  (  perf_file,  "x  -  ("  ); 
new_line  (  perf_file  ); 

for  i  in  perf  vector 'range  loop 
put  (  perf_‘file,  temp  interval  ); 
new_line  (  perf_file  T; 

temp_interval  temp_interval  +  interval; 
end  loop; 

put  (  perf_file,  "]"  ); 
new_line  (  perf_file  ); 
put  (  perf_file,  "y  -  ["  ); 
new_line  (  perf_file  ); 

for  i  in  perf  vector 'range  loop 
put  (  perf_?ile,  per f_vector ( i )  ); 


new_line  (  perf_file  ) ; 
end  loop; 

put  (  perf_file,  "1"  ); 
end  store_net_perf ; 

end  somp_io; 


I 


I 
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Nath.Lib.Eztension 


—  Purposa:  Providas  accaaa  to  psaudorandom  numbar 

—  ganaratora  for  both  uniform  and  Gaussian  distributions. 

—  Inputs:  1)  Saa  individual  routinas. 

—  Outputs:  1)  Sea  individual  routinas. 

—  Author:  Dennis  W.  Ruck  (GE-87D) ,  AFIT/ENG 


with  system;  use  system; 

package  math.lib.extension  is 

type  time.array  is  new  unsigned.word.array  (1..7); 

procedure  mth.random  (  val  :  out  float;  seed  :  in  out  unsigned. longword  ); 

pragma  INTERFACE  (  vaxrtl,  mth.random  ); 

pragma  IMPORT. VALUED.PROCEDURE  (  mth.random,  "HTHIRANDOM" , 

mechanism  *>  (value,  reference)); 

procedure  uniform  (  center  :  in  float; 

width  :  in  float; 

seed  :  in  out  unsigned. longword; 

val  :  out  float  ) ; 

procedure  gaussian  (  mean  :  in  float; 

variance  :  in  float; 

seed  :  in  out  unsigned.longword; 

val  :  out  float  ) ; 

function  get. seed  return  unsigned.longword; 

end  math.lib.extension; 
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Math_Lib_Ezt«nsion  (packaga  body) 


—  Purposa:  Packaga  to  provida  psaudaorandom  numbar  ganarators 

—  with  uniform  and  Gaussian  distributions. 

—  Inputs:  1)  Saa  tha  individual  routinas. 

—  Outputs:  1)  Saa  tha  individual  routinas. 

—  Author:  Dannis  W.  Ruck  (GE-87D) .  AFIT/ENG 


- *i|l«******«4l*«*«««««*«*«*«4l4i4i«««««««4i**«4i4i«*****4i**««*«*««*« 


with  starlat; 

with  condition.handling; 

with  float.math.lib;  usa  float.math.lib; 


packaga  body  math.lib.axtansion  is 


—  This  procadura  will  return 

—  about  CENTER  plus  or  minus 
procadura  uniform  (  canter 

width 

seed 

val 


a  uniform  random  sample  cantered 
WIDTH . 
in  float; 
in  float; 

in  out  unsigned .longword; 
out  float  )  is 


X  :  float; 


begin 

—  gat  uniform  random  variable  between  0  and  1 
mth.rauidom  (  x ,  seed  ) ; 

—  adjust  to  center  +/-  width 
X  :■  X  *  width  *  2.0; 

X  :*  X  +  center  -  width; 


val  : »  X ; 
end  uniform; 
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—  This  procsdars  will  rsturn  a  gaussian  random  variabla  sampla 

—  sith  maan  MEAN  and  varianca  of  VARIANCE.  Tha  eantral  limit  thaoram 

—  is  invokad  to  approzimata  tha  gaussian  with  a  sum  of  uniform  RVs. 


procedure  gaussicun  (  mean  :  in  float; 

variance  :  in  float; 

seed  ;  in  out  unsigned.longword; 

val  :  out  float  )  is 

num.rvs 

constant  :«  20; 

sum 

float  :■  0.0; 

X 

float; 

Z 

float ; 

Y 

float ; 

ave 

float; 

norm 

float ; 

begin 

—  Obtain  a  sum  of  random  variables  that  are  uniform  betueen 

—  0  and  1. 

for  i  in  l..num_rvs  loop 
mth.random  (  x ,  seed  ) ; 
sum  :•  sum  x; 
end  loop; 

ave  :■  sum  /  float (num.rvs) ; 

—  AVE  is  a  rv  with  mean  ■  0.5  and  variance  «  l/(12*num_rvs) ; 

—  now  normalize  AVE 

Z  :*  (ave-0.5)/sqrt(l. 0/(12. 0*float(num.rvs))); 

—  Now  unnormalize  to  desired  mean  and  variance 
Y  :«  mean  sqrt (variance) *Z; 

val  :=  Y; 

end  gaussian; 

function  get.seed  return  unsigned.longword  is 

—  Returns  the  lower  unsigned  longword  of  the  binary  representation 

—  of  the  system  time  as  the  initial  seed  for  the  pseudo- 

—  random  number  generator. 
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status  :  condition.handling.cond.valus.typa; 

bintin  :  nnsigned.quadsord; 

bagin 

STARLET. gattim  (  status,  bintim  ); 
raturn  bintim. LO; 
and  gat.saad; 
and  math.lib.aztansion; 


i 


I 


V«ctor_Op«ratioDS  (packag*  spac) 


—  Purposa:  Provida  ganaral  vactor  oparations  to  allov 

—  a  aora  raadabla  iaplaaantation  of  aquations  consisting 

—  of  ona  and  two  diaansional  arrays. 

—  Inputs:  1)  Saa  individual  routinas. 

—  Outputs:  1)  Saa  individual  routinas. 

—  Author:  Dannis  W.  Ruck  (GE-87D) .  AFIT/ENG 

—  Hodifiad  By:  Char las  C.  Piazza  (GE-88D) .  AFIT/ENG 

with  taxt.io; 

packaga  vactor. oparations  is 

typa  vactor  is  array  (  intagar  ranga  <>  )  of  FLOAT; 

typa  matrix  is  array  (  intagar  ranga  <>,  intagar  ranga  <>  )  of  FLOAT; 

function  (  laft  :  vactor; 

right  :  matrix  )  raturn  vactor; 

function  (  laft,  right  :  vactor  )  raturn  float; 

function  (  laft  :  float; 

right  :  matrix  )  raturn  matrix; 

function  (  laft  :  float; 

right  :  vactor  )  return  vactor; 

function  ”♦'*  (  left,  right  :  matrix  )  raturn  matrix; 

function  "+”  (  left,  right  :  vector  )  return  vector; 

function  (  left,  right  :  matrix  )  return  matrix; 

function  (  left,  right  :  vector  )  return  vector; 
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lunctioii  diatanc*  (  laft,  right  :  vactor  )  ratum  float; 

procadura  put  (  output  :  in  taxt.io.fila.typa; 
data  :  in  vactor  ) ; 

procadura  put  (  data  :  in  matrix  ) ; 

procadura  gat  (  input  :  in  taxt.io.fila.typa; 

data  :  out  vactor  ); 

and  vactor.oparationa; 


V«ctor.Op«ratioM  (packaga  body) 


—  Parposa:  Provida  ganaric  vactor  oparations  to  allov  a  mora 

—  raadabla  iaplamantation  of  aquatioaa  conaisting  of  ona  and 

—  two  diaanaional  arrays. 

—  Inputs:  1)  Saa  individual  routinas. 

—  Outputs:  1)  Saa  individual  routinas. 

—  Author:  Dannis  W.  Ruck  (GE-870) .  AFIT/ENG 

--  Modifiad  By:  Charlas  C.  Piazza  (GE-88D),  AFIT/ENG 


vith  tazt.io;  usa  taxt.io; 

with  float.taxt.io;  usa  float.taxt.io; 

with  float .math. lib;  usa  float.math.lib; 

packaga  body  vactor.oparations  is 

function  (  laft  :  vactor; 

right  :  matrix  )  ratum  vactor  is 

sum  :  FLOAT ; 

product  :  vactor  (  right’ranga(2)  ); 
begin 

for  j  in  right 'range (2)  loop 
sum  :*  0.0; 

for  i  in  right ’ range ( 1 )  loop 

sum  :■  sum  +  laft  (i)  •  right  (i,j); 
end  loop; 

product  (j)  :«  sum; 
end  loop; 

ratum  product; 

end 
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function  (  loft,  right  :  voctor  )  rotum  FLOAT  is 


sun  :  FLOAT  :•  0.0; 
bogin 

for  i  in  loft’rango  loop 

SUB  :■  SUB  *  loft  (i)  «  right  (i); 
ond  loop; 

rotum  sub; 

ond 


fiinction  (  loft  :  float; 

right  :  matrix  )  rotum  matrix  is 

product  :  matrix  (  right ’rango(l) ,  right*rango(2)  ); 

bogin 

for  i  in  right ’rango(l)  loop 
for  j  in  right ’rango(2)  loop 

product  (i,  j)  :■  loft  *  right  (i,  j); 
ond  loop; 
ond  loop; 

rotum  product; 

ond 


function  (  loft  :  float; 

right  :  voctor  )  roturn  voctor  is 

product  ;  voctor  (  right ’ range  ) ; 

bogin 

for  i  in  right ’range  loop 

product  (i)  :*  left  *  right  (i) ; 
end  loop; 


r«turn  product; 


•nd 

function  (  loft,  right  :  matrix  )  rattum  matrix  is 
aum  :  matrix  (  loft*rango(l) ,  loft’rango(2)  ); 
bogin 

for  i  in  left’rango(2)  loop 
for  j  in  loft'rango  (1)  loop 

sum  (i,  j)  :■  loft  (i,  j)  +  right  (i,  j); 
ond  loop: 
ond  loop; 

roturn  sum; 

ond 

function  (  left,  right  :  voctor  )  roturn  voctor  is 

sum  :  voctor  (  loft ’ rango  ) ; 

bogin 

for  i  in  loft ’range  loop 

sum  (i)  :■  left  (i)  right  (i); 
end  loop; 

roturn  sum; 

ond 


function  (  left,  right  :  matrix  )  return  matrix  is 
diff  :  matrix  (  left’rango(l) ,  left’remge(2)  ); 
begin 

for  i  in  left ’range(2)  loop 
for  j  in  left’ range (1)  loop 
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diff  (i,  j)  :»  left  (i,  j)  -  right  (i,  j); 
end  loop; 
end  loop; 

return  diff; 

end 

function  (  left,  right  :  vector  )  return  vector  is 

diff  :  vector  (  left 'range  ); 

begin 

for  i  in  left 'range  loop 

diff  (i)  :*  left  (i)  -  right  (i); 
end  loop; 

return  diff ; 

end 


function  distance  (  left,  right  :  vector  )  return  float  is 

sum_x2  :  float  :=  0.0; 

begin 

for  j  in  left 'range  loop 

suffi_x2  :»  sum_x2  +  (  left  (j)  -  right  (j)  )  *  (  left  (j)  -  right  (j)  ); 
end  loop; 

return  sqrt  (  suin_x2  ) ; 
end  distance; 


procedure  put  (  output  :  in  text.io.f ile.type; 

data  ;  in  vector  )  is 


col.max  :  constant  :=  72; 
width  :  constant  ;=  10; 
col  :  positive  1; 
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b«gin 

for  j  in  data’rangn  loop 

put  (  output ,  data  ( j ) ,  0 ,  6 ,  1  ) ; 
put  (  output ,  "  "  ) : 
col  :■  col  ♦  width; 
if  col  >  col.max  then 
new.line  (  output  ) ; 
col  :■  1; 
end  if; 
end  loop; 
end  put; 

procedure  put  (  data  ;  in  matrix  )  is 

col.max  :  constant  :*  72; 

width  :  constant  :*  10; 

col  :  positive  :*  1; 

begin 

for  i  in  data’ ranged)  loop 
for  j  in  data'rauige(2)  loop 
put  (  data  (i.j),  1,  4,  0  ); 
put  (•'"): 
col  :■  col  +  width; 
if  col  >  col.max  then 
new.line; 
col  :»  1; 
end  if ; 
end  loop; 
new.line; 
col  :*  1; 
end  loop; 
end  put; 

e 

procedure  get  (  input  :  in  text.io. file. type 
data  :  out  vector  )  is 

begin 

for  j  in  data ’range  loop 
get  (  input,  data  (j)  ); 
end  loop; 
end  get; 
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•nd  v«ctor_op«rations 
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